orders¶
order_id(PK): 주문의 고유 식별자, 이 테이블의 기본 키 역할customer_id: 고객의 고유 식별자order_status: 주문 상태. 예: 배송됨, 취소됨, 처리 중 등order_purchase_timestamp: 고객이 주문을 한 시점의 타임스탬프order_approved_at: 판매자가 주문을 승인한 시점의 타임스탬프order_delivered_timestamp: 고객 위치에 주문이 배송된 시점의 타임스탬프order_estimated_delivery_date: 주문할 때 고객에게 제공된 예상 배송 날짜
order_items¶
order_id(PK): 주문의 고유 식별자order_item_id(PK): 각 주문 내 항목 번호. 이 컬럼과 함께 order_id가 이 테이블의 기본 키 역할product_id: 제품의 고유 식별자seller_id: 판매자의 고유 식별자price: 제품의 판매 가격shipping_charges: 제품의 배송에 관련된 비용
customers¶
customer_id(PK): 고객의 고유 식별자, 이 테이블의 기본 키 역할customer_zip_code_prefix: 고객의 우편번호customer_city: 고객의 도시customer_state: 고객의 주
payments¶
order_id: 주문의 고유 식별자, 이 테이블에서 이 컬럼은 중복될 수 있다payment_sequential: 주어진 주문에 대한 결제 순서 정보를 제공payment_type: 결제 유형 예: 신용카드, 직불카드 등payment_installments: 신용카드 결제 시 할부 회차payment_value: 거래 금액
products¶
product_id: 각 제품의 고유 식별자, 이 테이블의 기본 키 역할product_category_name: 제품이 속한 카테고리 이름product_weight_g: 제품 무게 (그램)product_length_cm: 제품 길이 (센티미터)product_height_cm: 제품 높이 (센티미터)product_width_cm: 제품 너비 (센티미터)
1-1. Data load¶
- merge_df : 모든 테이블을 하나의 데이테세으로 만든 데이터
# 필요한 모듈들 모두 이 셀에서 관리
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
# 데이터 불러오기
import pandas as pd
customers_df = pd.read_csv("D:/bootcamp/machine_learning/project/customers.csv") #99441
order_items_df = pd.read_csv("D:/bootcamp/machine_learning/project/order_items.csv") #112650
orders_df = pd.read_csv("D:/bootcamp/machine_learning/project/orders.csv") #99441
payments_df = pd.read_csv("D:/bootcamp/machine_learning/project/payments.csv") #103886
products_df = pd.read_csv("D:/bootcamp/machine_learning/project/products.csv") #32951
customers_df = customers_df.drop_duplicates()
# orders_df와 customers_df를 customer_id를 기준으로 병합
merged_df = pd.merge(orders_df, customers_df, on='customer_id', how='left')
# merged_df와 order_items_df를 order_id를 기준으로 병합
merged_df = pd.merge(merged_df, order_items_df, on='order_id', how='left')
# merged_df와 payments_df를 order_id를 기준으로 병합
merged_df = pd.merge(merged_df, payments_df, on='order_id', how='left')
# merged_df와 products_df를 product_id를 기준으로 병합
final_df = pd.merge(merged_df, products_df, on='product_id', how='left')
# 결과를 CSV 파일로 저장
# final_df.to_csv("merged_data_final.csv", index=False)
merge_df = final_df
merge_df.head(5)
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_timestamp | order_estimated_delivery_date | customer_zip_code_prefix | customer_city | customer_state | ... | shipping_charges | payment_sequential | payment_type | payment_installments | payment_value | product_category_name | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | delivered | 10/2/2017 10:56 | 10/2/2017 11:07 | 10/10/2017 21:25 | 10/18/2017 0:00 | 3149 | sao paulo | SP | ... | 8.72 | 1.0 | credit_card | 1.0 | 18.12 | housewares | 500.0 | 19.0 | 8.0 | 13.0 |
| 1 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | delivered | 10/2/2017 10:56 | 10/2/2017 11:07 | 10/10/2017 21:25 | 10/18/2017 0:00 | 3149 | sao paulo | SP | ... | 8.72 | 3.0 | voucher | 1.0 | 2.00 | housewares | 500.0 | 19.0 | 8.0 | 13.0 |
| 2 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | delivered | 10/2/2017 10:56 | 10/2/2017 11:07 | 10/10/2017 21:25 | 10/18/2017 0:00 | 3149 | sao paulo | SP | ... | 8.72 | 2.0 | voucher | 1.0 | 18.59 | housewares | 500.0 | 19.0 | 8.0 | 13.0 |
| 3 | 53cdb2fc8bc7dce0b6741e2150273451 | af07308b275d755c9edb36a90c618231 | delivered | 7/24/2018 20:41 | 7/26/2018 3:24 | 8/7/2018 15:27 | 8/13/2018 0:00 | 47813 | barreiras | BA | ... | 22.76 | 1.0 | wallet | 1.0 | 141.46 | toys | 400.0 | 19.0 | 13.0 | 19.0 |
| 4 | 47770eb9100c2d0c44946d9cf07ec65d | 3a653a41f6f9fc3d2a113cf8398680e8 | delivered | 8/8/2018 8:38 | 8/8/2018 8:55 | 8/17/2018 18:06 | 9/4/2018 0:00 | 75265 | vianopolis | GO | ... | 19.22 | 1.0 | credit_card | 3.0 | 179.12 | toys | 420.0 | 24.0 | 19.0 | 21.0 |
5 rows × 24 columns
2. Exploratory Data Analysis(EDA)¶
2-1. 데이터 간단히 살펴보기¶
# 데이터 정보 요약 : info()
merge_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119160 entries, 0 to 119159 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 119160 non-null object 1 customer_id 119160 non-null object 2 order_status 119160 non-null object 3 order_purchase_timestamp 119160 non-null object 4 order_approved_at 118982 non-null object 5 order_delivered_timestamp 115738 non-null object 6 order_estimated_delivery_date 119160 non-null object 7 customer_zip_code_prefix 119160 non-null int64 8 customer_city 119160 non-null object 9 customer_state 119160 non-null object 10 order_item_id 118325 non-null float64 11 product_id 118325 non-null object 12 seller_id 118325 non-null object 13 price 118325 non-null float64 14 shipping_charges 118325 non-null float64 15 payment_sequential 119157 non-null float64 16 payment_type 119157 non-null object 17 payment_installments 119157 non-null float64 18 payment_value 119157 non-null float64 19 product_category_name 117893 non-null object 20 product_weight_g 118305 non-null float64 21 product_length_cm 118305 non-null float64 22 product_height_cm 118305 non-null float64 23 product_width_cm 118305 non-null float64 dtypes: float64(10), int64(1), object(13) memory usage: 21.8+ MB
# 데이터 기술통계량 : describe()
merge_df.describe(include='all')
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_timestamp | order_estimated_delivery_date | customer_zip_code_prefix | customer_city | customer_state | ... | shipping_charges | payment_sequential | payment_type | payment_installments | payment_value | product_category_name | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 119160 | 119160 | 119160 | 119160 | 118982 | 115738 | 119160 | 119160.000000 | 119160 | 119160 | ... | 118325.000000 | 119157.000000 | 119157 | 119157.000000 | 119157.000000 | 117893 | 118305.000000 | 118305.000000 | 118305.000000 | 118305.000000 |
| unique | 99441 | 96096 | 8 | 88789 | 50462 | 75649 | 459 | NaN | 4119 | 27 | ... | NaN | NaN | 5 | NaN | NaN | 70 | NaN | NaN | NaN | NaN |
| top | 895ab968e7bb0d5659d16cd74cd1650c | 9a736b248f67d166d2fbb006bcb877c3 | delivered | 8/8/2017 20:26 | 1/10/2018 10:32 | 8/14/2017 12:46 | 12/20/2017 0:00 | NaN | sao paulo | SP | ... | NaN | NaN | credit_card | NaN | NaN | toys | NaN | NaN | NaN | NaN |
| freq | 63 | 75 | 115739 | 63 | 121 | 63 | 658 | NaN | 18864 | 50305 | ... | NaN | NaN | 87795 | NaN | NaN | 88791 | NaN | NaN | NaN | NaN |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 35000.842204 | NaN | NaN | ... | 20.037466 | 1.094178 | NaN | 2.941053 | 172.825878 | NaN | 2113.957364 | 30.254537 | 16.633143 | 23.066692 |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 29806.174149 | NaN | NaN | ... | 15.847918 | 0.726675 | NaN | 2.777111 | 267.813648 | NaN | 3789.273735 | 16.189991 | 13.453618 | 11.743808 |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1003.000000 | NaN | NaN | ... | 0.000000 | 1.000000 | NaN | 0.000000 | 0.000000 | NaN | 0.000000 | 7.000000 | 2.000000 | 6.000000 |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 11250.000000 | NaN | NaN | ... | 13.080000 | 1.000000 | NaN | 1.000000 | 60.890000 | NaN | 300.000000 | 18.000000 | 8.000000 | 15.000000 |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 24240.000000 | NaN | NaN | ... | 16.280000 | 1.000000 | NaN | 2.000000 | 108.210000 | NaN | 700.000000 | 25.000000 | 13.000000 | 20.000000 |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 58415.000000 | NaN | NaN | ... | 21.190000 | 1.000000 | NaN | 4.000000 | 189.260000 | NaN | 1800.000000 | 38.000000 | 20.000000 | 30.000000 |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 99990.000000 | NaN | NaN | ... | 409.680000 | 29.000000 | NaN | 24.000000 | 13664.080000 | NaN | 40425.000000 | 105.000000 | 105.000000 | 118.000000 |
11 rows × 24 columns
# 결측치 확인
merge_df.isna().sum()
order_id 0 customer_id 0 order_status 0 order_purchase_timestamp 0 order_approved_at 178 order_delivered_timestamp 3422 order_estimated_delivery_date 0 customer_zip_code_prefix 0 customer_city 0 customer_state 0 order_item_id 835 product_id 835 seller_id 835 price 835 shipping_charges 835 payment_sequential 3 payment_type 3 payment_installments 3 payment_value 3 product_category_name 1267 product_weight_g 855 product_length_cm 855 product_height_cm 855 product_width_cm 855 dtype: int64
# 결측치를 포함하는 column
missing_columns = merge_df.isna().sum()[merge_df.isna().sum() > 0].index
missing_columns
Index(['order_approved_at', 'order_delivered_timestamp', 'order_item_id',
'product_id', 'seller_id', 'price', 'shipping_charges',
'payment_sequential', 'payment_type', 'payment_installments',
'payment_value', 'product_category_name', 'product_weight_g',
'product_length_cm', 'product_height_cm', 'product_width_cm'],
dtype='object')
수치형, 범주형 변수 나눠서 데이터 살펴보기
merge_df.iloc[:, 0:13].head(3)
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_timestamp | order_estimated_delivery_date | customer_zip_code_prefix | customer_city | customer_state | order_item_id | product_id | seller_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | delivered | 10/2/2017 10:56 | 10/2/2017 11:07 | 10/10/2017 21:25 | 10/18/2017 0:00 | 3149 | sao paulo | SP | 1.0 | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 |
| 1 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | delivered | 10/2/2017 10:56 | 10/2/2017 11:07 | 10/10/2017 21:25 | 10/18/2017 0:00 | 3149 | sao paulo | SP | 1.0 | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 |
| 2 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | delivered | 10/2/2017 10:56 | 10/2/2017 11:07 | 10/10/2017 21:25 | 10/18/2017 0:00 | 3149 | sao paulo | SP | 1.0 | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 |
merge_df.iloc[:, 13:].head(3)
| price | shipping_charges | payment_sequential | payment_type | payment_installments | payment_value | product_category_name | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29.99 | 8.72 | 1.0 | credit_card | 1.0 | 18.12 | housewares | 500.0 | 19.0 | 8.0 | 13.0 |
| 1 | 29.99 | 8.72 | 3.0 | voucher | 1.0 | 2.00 | housewares | 500.0 | 19.0 | 8.0 | 13.0 |
| 2 | 29.99 | 8.72 | 2.0 | voucher | 1.0 | 18.59 | housewares | 500.0 | 19.0 | 8.0 | 13.0 |
수치형:
order_item_idpriceshipping_chargespayment_sequential: 애매하긴 한데... 일단 수치형payment_installmentspayment_valueproduct_weight_g,product_length_cm,product_height_cm,product_width_cm
범주형:
order_statusorder_purchase_timestamp,order_approved_at,order_delivered_timestamp,order_estimated_delivery_date: 나중에 분기별로 혹은 달별로 범주를 나눠 사용하면 좋을듯customer_zip_code_prefix,customer_city,customer_statepayment_typeproduct_category_name
numerical_cols = ['order_item_id', 'price', 'shipping_charges', 'payment_sequential', 'payment_installments', 'payment_value',
'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']
categorical_cols = ['order_status', 'order_purchase_timestamp', 'order_approved_at', 'order_delivered_timestamp',
'order_estimated_delivery_date', 'customer_zip_code_prefix', 'customer_city', 'customer_state',
'payment_type', 'product_category_name']
# 수치형 변수에 대한 value_counts
numerical_df = merge_df[numerical_cols]
for col in numerical_df.columns:
print(numerical_df[col].value_counts().to_frame().join(numerical_df[col].value_counts(normalize=True).to_frame().cumsum()))
print("=" * 50)
count proportion
order_item_id
1.0 103656 0.876028
2.0 10314 0.963195
3.0 2403 0.983503
4.0 995 0.991912
5.0 473 0.995910
6.0 264 0.998141
7.0 62 0.998665
8.0 37 0.998977
9.0 29 0.999222
10.0 26 0.999442
11.0 17 0.999586
12.0 13 0.999696
13.0 8 0.999763
14.0 7 0.999823
15.0 5 0.999865
16.0 3 0.999890
17.0 3 0.999915
18.0 3 0.999941
19.0 3 0.999966
20.0 3 0.999992
21.0 1 1.000000
==================================================
count proportion
price
59.90 2628 0.022210
69.90 2116 0.040093
49.90 2049 0.057410
89.90 1632 0.071202
99.90 1531 0.084141
... ... ...
424.90 1 0.999966
234.80 1 0.999975
119.95 1 0.999983
107.94 1 0.999992
213.39 1 1.000000
[5968 rows x 2 columns]
==================================================
count proportion
shipping_charges
15.10 3856 0.032588
7.78 2353 0.052474
14.10 1989 0.069284
11.85 1982 0.086034
18.23 1631 0.099818
... ... ...
39.37 1 0.999966
49.03 1 0.999975
40.72 1 0.999983
48.10 1 0.999992
36.89 1 1.000000
[6999 rows x 2 columns]
==================================================
count proportion
payment_sequential
1.0 114011 0.956813
2.0 3430 0.985599
3.0 662 0.991155
4.0 320 0.993840
5.0 192 0.995451
6.0 134 0.996576
7.0 92 0.997348
8.0 61 0.997860
9.0 50 0.998280
10.0 41 0.998624
11.0 35 0.998917
12.0 27 0.999144
13.0 16 0.999278
14.0 13 0.999387
15.0 11 0.999480
16.0 9 0.999555
18.0 9 0.999631
17.0 9 0.999706
19.0 9 0.999782
21.0 6 0.999832
20.0 6 0.999883
22.0 3 0.999908
25.0 2 0.999924
26.0 2 0.999941
23.0 2 0.999958
24.0 2 0.999975
27.0 1 0.999983
29.0 1 0.999992
28.0 1 1.000000
==================================================
count proportion
payment_installments
1.0 59438 0.498821
2.0 13865 0.615180
3.0 11883 0.714906
4.0 8073 0.782656
10.0 6975 0.841193
5.0 6102 0.892402
8.0 5109 0.935279
6.0 4670 0.974471
7.0 1857 0.990055
9.0 747 0.996324
12.0 166 0.997717
15.0 94 0.998506
18.0 38 0.998825
24.0 34 0.999110
11.0 26 0.999329
20.0 21 0.999505
13.0 18 0.999656
14.0 16 0.999790
17.0 8 0.999857
16.0 7 0.999916
21.0 5 0.999958
0.0 3 0.999983
23.0 1 0.999992
22.0 1 1.000000
==================================================
count proportion
payment_value
50.00 351 0.002946
100.00 302 0.005480
20.00 289 0.007906
77.57 251 0.010012
35.00 167 0.011414
... ... ...
202.55 1 0.999966
268.97 1 0.999975
381.76 1 0.999983
202.53 1 0.999992
182.39 1 1.000000
[29077 rows x 2 columns]
==================================================
count proportion
product_weight_g
200.0 7092 0.059947
150.0 5414 0.105710
250.0 4727 0.145666
300.0 4444 0.183230
400.0 3780 0.215181
... ... ...
726.0 1 0.999966
8575.0 1 0.999975
3598.0 1 0.999983
11025.0 1 0.999992
2676.0 1 1.000000
[2204 rows x 2 columns]
==================================================
count proportion
product_length_cm
16.0 18420 0.155699
20.0 10976 0.248476
30.0 7954 0.315709
17.0 6208 0.368184
18.0 5902 0.418072
... ... ...
83.0 8 0.999831
96.0 8 0.999899
94.0 6 0.999949
9.0 4 0.999983
8.0 2 1.000000
[99 rows x 2 columns]
==================================================
count proportion
product_height_cm
10.0 10375 0.087697
20.0 6935 0.146317
15.0 6880 0.204471
12.0 6533 0.259693
11.0 6427 0.314019
... ... ...
98.0 3 0.999932
92.0 3 0.999958
94.0 2 0.999975
97.0 2 0.999992
89.0 1 1.000000
[102 rows x 2 columns]
==================================================
count proportion
product_width_cm
20.0 12727 0.107578
11.0 11140 0.201741
15.0 9373 0.280969
16.0 8859 0.355851
30.0 8046 0.423862
... ... ...
103.0 1 0.999966
97.0 1 0.999975
104.0 1 0.999983
98.0 1 0.999992
86.0 1 1.000000
[95 rows x 2 columns]
==================================================
# 수치형 변수 기술통계량
numerical_df.describe()
| order_item_id | price | shipping_charges | payment_sequential | payment_installments | payment_value | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 118325.000000 | 118325.000000 | 118325.000000 | 119157.000000 | 119157.000000 | 119157.000000 | 118305.000000 | 118305.000000 | 118305.000000 | 118305.000000 |
| mean | 1.196569 | 120.729200 | 20.037466 | 1.094178 | 2.941053 | 172.825878 | 2113.957364 | 30.254537 | 16.633143 | 23.066692 |
| std | 0.699144 | 184.130919 | 15.847918 | 0.726675 | 2.777111 | 267.813648 | 3789.273735 | 16.189991 | 13.453618 | 11.743808 |
| min | 1.000000 | 0.850000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | 2.000000 | 6.000000 |
| 25% | 1.000000 | 39.900000 | 13.080000 | 1.000000 | 1.000000 | 60.890000 | 300.000000 | 18.000000 | 8.000000 | 15.000000 |
| 50% | 1.000000 | 74.900000 | 16.280000 | 1.000000 | 2.000000 | 108.210000 | 700.000000 | 25.000000 | 13.000000 | 20.000000 |
| 75% | 1.000000 | 134.900000 | 21.190000 | 1.000000 | 4.000000 | 189.260000 | 1800.000000 | 38.000000 | 20.000000 | 30.000000 |
| max | 21.000000 | 6735.000000 | 409.680000 | 29.000000 | 24.000000 | 13664.080000 | 40425.000000 | 105.000000 | 105.000000 | 118.000000 |
# 범주형 변수에 대한 value_counts
categorical_df = merge_df[categorical_cols]
for col in categorical_df.columns:
print(categorical_df[col].value_counts().to_frame().join(categorical_df[col].value_counts(normalize=True).to_frame().cumsum()))
print("=" * 50)
count proportion
order_status
delivered 115739 0.971291
shipped 1257 0.981840
canceled 752 0.988150
unavailable 653 0.993630
processing 376 0.996786
invoiced 375 0.999933
created 5 0.999975
approved 3 1.000000
==================================================
count proportion
order_purchase_timestamp
8/8/2017 20:26 63 0.000529
9/23/2017 14:56 38 0.000848
8/2/2018 12:06 35 0.001141
8/2/2018 12:05 31 0.001401
4/20/2017 12:45 29 0.001645
... ... ...
8/17/2018 7:44 1 0.999966
12/26/2017 14:16 1 0.999975
7/6/2018 19:14 1 0.999983
10/15/2017 18:32 1 0.999992
3/8/2018 20:57 1 1.000000
[88789 rows x 2 columns]
==================================================
count proportion
order_approved_at
1/10/2018 10:32 121 0.001017
12/1/2017 11:31 94 0.001807
7/24/2018 10:31 88 0.002547
11/7/2017 7:30 87 0.003278
2/27/2018 4:31 82 0.003967
... ... ...
8/23/2017 20:15 1 0.999966
7/3/2017 16:30 1 0.999975
7/28/2018 0:25 1 0.999983
10/7/2016 23:13 1 0.999992
9/14/2017 12:30 1 1.000000
[50462 rows x 2 columns]
==================================================
count proportion
order_delivered_timestamp
8/14/2017 12:46 63 0.000544
10/18/2017 22:35 38 0.000873
3/5/2018 15:22 30 0.001132
6/22/2017 16:04 27 0.001365
2/28/2018 20:09 26 0.001590
... ... ...
1/31/2018 19:16 1 0.999965
11/8/2017 13:42 1 0.999974
6/5/2018 0:44 1 0.999983
7/27/2018 19:56 1 0.999991
3/16/2018 13:08 1 1.000000
[75649 rows x 2 columns]
==================================================
count proportion
order_estimated_delivery_date
12/20/2017 0:00 658 0.005522
3/12/2018 0:00 617 0.010700
3/13/2018 0:00 616 0.015869
5/29/2018 0:00 615 0.021031
5/30/2018 0:00 593 0.026007
... ... ...
11/14/2016 0:00 1 0.999966
11/7/2016 0:00 1 0.999975
1/9/2017 0:00 1 0.999983
10/28/2016 0:00 1 0.999992
10/27/2016 0:00 1 1.000000
[459 rows x 2 columns]
==================================================
count proportion
customer_zip_code_prefix
24220 163 0.001368
22790 156 0.002677
22793 155 0.003978
24230 139 0.005144
22775 132 0.006252
... ... ...
58086 1 0.999966
68798 1 0.999975
55365 1 0.999983
89086 1 0.999992
45920 1 1.000000
[14994 rows x 2 columns]
==================================================
count proportion
customer_city
sao paulo 18864 0.158308
rio de janeiro 8322 0.228147
belo horizonte 3310 0.255925
brasilia 2488 0.276804
curitiba 1827 0.292137
... ... ...
igrapiuna 1 0.999966
nantes 1 0.999975
carnauba dos dantas 1 0.999983
satiro dias 1 0.999992
nova vicosa 1 1.000000
[4119 rows x 2 columns]
==================================================
count proportion
customer_state
SP 50305 0.422163
RJ 15527 0.552467
MG 13826 0.668496
RS 6542 0.723397
PR 6038 0.774068
SC 4355 0.810616
BA 4091 0.844948
DF 2505 0.865970
GO 2461 0.886623
ES 2366 0.906479
PE 1905 0.922466
CE 1567 0.935616
MT 1131 0.945107
PA 1124 0.954540
MA 859 0.961749
MS 850 0.968882
PB 645 0.974295
PI 577 0.979137
RN 574 0.983954
AL 465 0.987857
SE 409 0.991289
TO 340 0.994142
RO 293 0.996601
AM 174 0.998061
AC 95 0.998859
AP 84 0.999564
RR 52 1.000000
==================================================
count proportion
payment_type
credit_card 87795 0.736801
wallet 23177 0.931309
voucher 6475 0.985649
debit_card 1707 0.999975
not_defined 3 1.000000
==================================================
count proportion
product_category_name
toys 88791 0.753149
health_beauty 3143 0.779809
bed_bath_table 2752 0.803152
sports_leisure 2404 0.823543
computers_accessories 2275 0.842841
... ... ...
fashion_childrens_clothes 2 0.999966
diapers_and_hygiene 1 0.999975
home_comfort_2 1 0.999983
furniture_mattress_and_upholstery 1 0.999992
security_and_services 1 1.000000
[70 rows x 2 columns]
==================================================
numerical_df.describe(include='all')
| order_item_id | price | shipping_charges | payment_sequential | payment_installments | payment_value | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 118325.000000 | 118325.000000 | 118325.000000 | 119157.000000 | 119157.000000 | 119157.000000 | 118305.000000 | 118305.000000 | 118305.000000 | 118305.000000 |
| mean | 1.196569 | 120.729200 | 20.037466 | 1.094178 | 2.941053 | 172.825878 | 2113.957364 | 30.254537 | 16.633143 | 23.066692 |
| std | 0.699144 | 184.130919 | 15.847918 | 0.726675 | 2.777111 | 267.813648 | 3789.273735 | 16.189991 | 13.453618 | 11.743808 |
| min | 1.000000 | 0.850000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | 2.000000 | 6.000000 |
| 25% | 1.000000 | 39.900000 | 13.080000 | 1.000000 | 1.000000 | 60.890000 | 300.000000 | 18.000000 | 8.000000 | 15.000000 |
| 50% | 1.000000 | 74.900000 | 16.280000 | 1.000000 | 2.000000 | 108.210000 | 700.000000 | 25.000000 | 13.000000 | 20.000000 |
| 75% | 1.000000 | 134.900000 | 21.190000 | 1.000000 | 4.000000 | 189.260000 | 1800.000000 | 38.000000 | 20.000000 | 30.000000 |
| max | 21.000000 | 6735.000000 | 409.680000 | 29.000000 | 24.000000 | 13664.080000 | 40425.000000 | 105.000000 | 105.000000 | 118.000000 |
2-2. 데이터 시각화¶
수치형 시각화 방법:
- 히스토그램 : histplot()
- 커널밀도추정함수 그래프 : kdeplot()
- 분포도 : displot()
- 박스플롯 : boxplot()
- 페어플롯 : pairplot()
범주형 시각화 방법:
- 막대그래프 : barplot()
- 카운트플롯 : countplot()
- 박스플롯 : boxplot()
데이터 상관관계 시각화 방법: 데이터 전처리 후 볼 예정
- 히트맵 : heatmpap()
- 라인플롯 : lineplot()
- 산점도 : scatterplot()
- 회귀선을 포함한 산점도 그래프 : regplot()
2-2-1. 수치형 시각화¶
# 수치형 히스토그램
plt.figure(figsize=(20, 10))
for idx, col in enumerate(numerical_cols):
plt.subplot(3, 4, idx+1)
ax = sns.histplot(merge_df[col], kde=False)
plt.title(f'Histogram of {col}')
plt.tight_layout()
plt.show()
# 수치형 kdeplot()
plt.figure(figsize=(20, 10))
for idx, col in enumerate(numerical_cols):
plt.subplot(3, 4, idx+1)
ax = sns.kdeplot(merge_df[col], shade=True)
plt.title(f'KDE Plot of {col}')
plt.tight_layout()
plt.show()
# 수치형 boxplot
plt.figure(figsize=(20, 10))
for idx, col in enumerate(numerical_cols):
plt.subplot(3, 4, idx+1)
sns.boxplot(y=merge_df[col])
plt.title(f'Boxplot of {col}')
plt.xlabel(col)
plt.ylabel('Value')
plt.tight_layout()
plt.show()
# 수치형 pairplot
sns.pairplot(merge_df[numerical_cols], diag_kind='kde')
plt.suptitle('Pairplot of Numerical Variables', y=1.02)
plt.show()
수치형 변수 시각화 해석:
- 시각화 결과 모든 column에서 데이터의 왼쪽 쏠림 현상 발견.
- 데이터의 표준화가 필요할 거 같다.
- 박스플롯을 살펴봤을 때 이상치가 많아 정제가 필요할 거 같다.
2-2-2. 범주형 시각화¶
# 범주형 countplot 상위 10개의 값만 시각화
rows = 5
cols = 2
# 플롯 크기 설정
fig, axes = plt.subplots(rows, cols, figsize=(15, 25))
# 모든 플롯을 차지할 수 있도록 축 배열 평탄화
axes = axes.flatten()
# 각 범주형 변수에 대해 상위 10개 값의 countplot 그리기
for i, col in enumerate(categorical_cols):
top_10_values = merge_df[col].value_counts().nlargest(10).index
sns.countplot(data=merge_df[merge_df[col].isin(top_10_values)], x=col, ax=axes[i])
axes[i].set_title(f'Top 10 Values in {col}')
axes[i].set_ylabel('Count')
axes[i].set_xlabel(col)
axes[i].tick_params(axis='x', rotation=45)
# 레이아웃 조정
plt.tight_layout()
plt.show()
3. Data Preprocessing¶
3-1. 결측치 처리¶
# 결측치를 포함하는 column
merge_df[missing_columns].isna().sum()
order_approved_at 178 order_delivered_timestamp 3422 order_item_id 835 product_id 835 seller_id 835 price 835 shipping_charges 835 payment_sequential 3 payment_type 3 payment_installments 3 payment_value 3 product_category_name 1267 product_weight_g 855 product_length_cm 855 product_height_cm 855 product_width_cm 855 dtype: int64
# missing_col_df에 대한 기술통계량
missing_col_df = merge_df[missing_columns]
missing_col_df.describe(include='all')
| order_approved_at | order_delivered_timestamp | order_item_id | product_id | seller_id | price | shipping_charges | payment_sequential | payment_type | payment_installments | payment_value | product_category_name | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 118982 | 115738 | 118325.000000 | 118325 | 118325 | 118325.000000 | 118325.000000 | 119157.000000 | 119157 | 119157.000000 | 119157.000000 | 117893 | 118305.000000 | 118305.000000 | 118305.000000 | 118305.000000 |
| unique | 50462 | 75649 | NaN | 32951 | 3095 | NaN | NaN | NaN | 5 | NaN | NaN | 70 | NaN | NaN | NaN | NaN |
| top | 1/10/2018 10:32 | 8/14/2017 12:46 | NaN | aca2eb7d00ea1a7b8ebd4e68314663af | 4a3ca9315b744ce9f8e9374361493884 | NaN | NaN | NaN | credit_card | NaN | NaN | toys | NaN | NaN | NaN | NaN |
| freq | 121 | 63 | NaN | 538 | 2155 | NaN | NaN | NaN | 87795 | NaN | NaN | 88791 | NaN | NaN | NaN | NaN |
| mean | NaN | NaN | 1.196569 | NaN | NaN | 120.729200 | 20.037466 | 1.094178 | NaN | 2.941053 | 172.825878 | NaN | 2113.957364 | 30.254537 | 16.633143 | 23.066692 |
| std | NaN | NaN | 0.699144 | NaN | NaN | 184.130919 | 15.847918 | 0.726675 | NaN | 2.777111 | 267.813648 | NaN | 3789.273735 | 16.189991 | 13.453618 | 11.743808 |
| min | NaN | NaN | 1.000000 | NaN | NaN | 0.850000 | 0.000000 | 1.000000 | NaN | 0.000000 | 0.000000 | NaN | 0.000000 | 7.000000 | 2.000000 | 6.000000 |
| 25% | NaN | NaN | 1.000000 | NaN | NaN | 39.900000 | 13.080000 | 1.000000 | NaN | 1.000000 | 60.890000 | NaN | 300.000000 | 18.000000 | 8.000000 | 15.000000 |
| 50% | NaN | NaN | 1.000000 | NaN | NaN | 74.900000 | 16.280000 | 1.000000 | NaN | 2.000000 | 108.210000 | NaN | 700.000000 | 25.000000 | 13.000000 | 20.000000 |
| 75% | NaN | NaN | 1.000000 | NaN | NaN | 134.900000 | 21.190000 | 1.000000 | NaN | 4.000000 | 189.260000 | NaN | 1800.000000 | 38.000000 | 20.000000 | 30.000000 |
| max | NaN | NaN | 21.000000 | NaN | NaN | 6735.000000 | 409.680000 | 29.000000 | NaN | 24.000000 | 13664.080000 | NaN | 40425.000000 | 105.000000 | 105.000000 | 118.000000 |
# missing_columns에 대한 value_counts
for col in missing_col_df.columns:
print(missing_col_df[col].value_counts().to_frame().join(missing_col_df[col].value_counts(normalize=True).to_frame().cumsum()))
print("=" * 50)
count proportion
order_approved_at
1/10/2018 10:32 121 0.001017
12/1/2017 11:31 94 0.001807
7/24/2018 10:31 88 0.002547
11/7/2017 7:30 87 0.003278
2/27/2018 4:31 82 0.003967
... ... ...
8/23/2017 20:15 1 0.999966
7/3/2017 16:30 1 0.999975
7/28/2018 0:25 1 0.999983
10/7/2016 23:13 1 0.999992
9/14/2017 12:30 1 1.000000
[50462 rows x 2 columns]
==================================================
count proportion
order_delivered_timestamp
8/14/2017 12:46 63 0.000544
10/18/2017 22:35 38 0.000873
3/5/2018 15:22 30 0.001132
6/22/2017 16:04 27 0.001365
2/28/2018 20:09 26 0.001590
... ... ...
1/31/2018 19:16 1 0.999965
11/8/2017 13:42 1 0.999974
6/5/2018 0:44 1 0.999983
7/27/2018 19:56 1 0.999991
3/16/2018 13:08 1 1.000000
[75649 rows x 2 columns]
==================================================
count proportion
order_item_id
1.0 103656 0.876028
2.0 10314 0.963195
3.0 2403 0.983503
4.0 995 0.991912
5.0 473 0.995910
6.0 264 0.998141
7.0 62 0.998665
8.0 37 0.998977
9.0 29 0.999222
10.0 26 0.999442
11.0 17 0.999586
12.0 13 0.999696
13.0 8 0.999763
14.0 7 0.999823
15.0 5 0.999865
16.0 3 0.999890
17.0 3 0.999915
18.0 3 0.999941
19.0 3 0.999966
20.0 3 0.999992
21.0 1 1.000000
==================================================
count proportion
product_id
aca2eb7d00ea1a7b8ebd4e68314663af 538 0.004547
99a4788cb24856965c36a24e339b6058 527 0.009001
422879e10f46682990de24d770e7f83d 506 0.013277
389d119b48cf3043d311335e499d9c6b 407 0.016717
368c6c730842d78016ad823897a372db 400 0.020097
... ... ...
66ba269ae7ed286b181ebfb76d1d989e 1 0.999966
f4a2e2f1cf569d9cee134511bdd0f969 1 0.999975
5f79ee3b64a3af922735487fb06da429 1 0.999983
6d3b335d1fba472b61c01ba542307f62 1 0.999992
006619bbed68b000c8ba3f8725d5409e 1 1.000000
[32951 rows x 2 columns]
==================================================
count proportion
seller_id
4a3ca9315b744ce9f8e9374361493884 2155 0.018213
6560211a19b47992c3666cc44a7e94c0 2132 0.036231
1f50f920176fa81dab994f9023523100 2019 0.053294
cc419e0650a3c5ba77189a1882b7556a 1857 0.068988
da8622b14eb17ae2831f4ac5b9dab84a 1646 0.082899
... ... ...
c74f14c1e26cf1bd5bd34b6a1e274426 1 0.999966
4ce6e5f6c52515177e18c1c9361d8677 1 0.999975
f90f77ef2799a27f80d90c425ca944f7 1 0.999983
c3e2398fcc7e581cda2e546557bf6968 1 0.999992
f3862c2188522d89860c38a3ea8b550d 1 1.000000
[3095 rows x 2 columns]
==================================================
count proportion
price
59.90 2628 0.022210
69.90 2116 0.040093
49.90 2049 0.057410
89.90 1632 0.071202
99.90 1531 0.084141
... ... ...
424.90 1 0.999966
234.80 1 0.999975
119.95 1 0.999983
107.94 1 0.999992
213.39 1 1.000000
[5968 rows x 2 columns]
==================================================
count proportion
shipping_charges
15.10 3856 0.032588
7.78 2353 0.052474
14.10 1989 0.069284
11.85 1982 0.086034
18.23 1631 0.099818
... ... ...
39.37 1 0.999966
49.03 1 0.999975
40.72 1 0.999983
48.10 1 0.999992
36.89 1 1.000000
[6999 rows x 2 columns]
==================================================
count proportion
payment_sequential
1.0 114011 0.956813
2.0 3430 0.985599
3.0 662 0.991155
4.0 320 0.993840
5.0 192 0.995451
6.0 134 0.996576
7.0 92 0.997348
8.0 61 0.997860
9.0 50 0.998280
10.0 41 0.998624
11.0 35 0.998917
12.0 27 0.999144
13.0 16 0.999278
14.0 13 0.999387
15.0 11 0.999480
16.0 9 0.999555
18.0 9 0.999631
17.0 9 0.999706
19.0 9 0.999782
21.0 6 0.999832
20.0 6 0.999883
22.0 3 0.999908
25.0 2 0.999924
26.0 2 0.999941
23.0 2 0.999958
24.0 2 0.999975
27.0 1 0.999983
29.0 1 0.999992
28.0 1 1.000000
==================================================
count proportion
payment_type
credit_card 87795 0.736801
wallet 23177 0.931309
voucher 6475 0.985649
debit_card 1707 0.999975
not_defined 3 1.000000
==================================================
count proportion
payment_installments
1.0 59438 0.498821
2.0 13865 0.615180
3.0 11883 0.714906
4.0 8073 0.782656
10.0 6975 0.841193
5.0 6102 0.892402
8.0 5109 0.935279
6.0 4670 0.974471
7.0 1857 0.990055
9.0 747 0.996324
12.0 166 0.997717
15.0 94 0.998506
18.0 38 0.998825
24.0 34 0.999110
11.0 26 0.999329
20.0 21 0.999505
13.0 18 0.999656
14.0 16 0.999790
17.0 8 0.999857
16.0 7 0.999916
21.0 5 0.999958
0.0 3 0.999983
23.0 1 0.999992
22.0 1 1.000000
==================================================
count proportion
payment_value
50.00 351 0.002946
100.00 302 0.005480
20.00 289 0.007906
77.57 251 0.010012
35.00 167 0.011414
... ... ...
202.55 1 0.999966
268.97 1 0.999975
381.76 1 0.999983
202.53 1 0.999992
182.39 1 1.000000
[29077 rows x 2 columns]
==================================================
count proportion
product_category_name
toys 88791 0.753149
health_beauty 3143 0.779809
bed_bath_table 2752 0.803152
sports_leisure 2404 0.823543
computers_accessories 2275 0.842841
... ... ...
fashion_childrens_clothes 2 0.999966
diapers_and_hygiene 1 0.999975
home_comfort_2 1 0.999983
furniture_mattress_and_upholstery 1 0.999992
security_and_services 1 1.000000
[70 rows x 2 columns]
==================================================
count proportion
product_weight_g
200.0 7092 0.059947
150.0 5414 0.105710
250.0 4727 0.145666
300.0 4444 0.183230
400.0 3780 0.215181
... ... ...
726.0 1 0.999966
8575.0 1 0.999975
3598.0 1 0.999983
11025.0 1 0.999992
2676.0 1 1.000000
[2204 rows x 2 columns]
==================================================
count proportion
product_length_cm
16.0 18420 0.155699
20.0 10976 0.248476
30.0 7954 0.315709
17.0 6208 0.368184
18.0 5902 0.418072
... ... ...
83.0 8 0.999831
96.0 8 0.999899
94.0 6 0.999949
9.0 4 0.999983
8.0 2 1.000000
[99 rows x 2 columns]
==================================================
count proportion
product_height_cm
10.0 10375 0.087697
20.0 6935 0.146317
15.0 6880 0.204471
12.0 6533 0.259693
11.0 6427 0.314019
... ... ...
98.0 3 0.999932
92.0 3 0.999958
94.0 2 0.999975
97.0 2 0.999992
89.0 1 1.000000
[102 rows x 2 columns]
==================================================
count proportion
product_width_cm
20.0 12727 0.107578
11.0 11140 0.201741
15.0 9373 0.280969
16.0 8859 0.355851
30.0 8046 0.423862
... ... ...
103.0 1 0.999966
97.0 1 0.999975
104.0 1 0.999983
98.0 1 0.999992
86.0 1 1.000000
[95 rows x 2 columns]
==================================================
order_approved_at¶
# 'order_approved_at' 컬럼에서 결측치가 포함된 행만 추출
missing_rows = merge_df[merge_df['order_approved_at'].isna()]
# 'order_status' 컬럼의 value_counts
order_status_counts = missing_rows['order_status'].value_counts()
print("Order Status Counts:")
print(order_status_counts)
Order Status Counts: order_status canceled 158 delivered 15 created 5 Name: count, dtype: int64
- canceled 인 경우 당연하게도 주문이 승인이 되지 않기 때문에 결측치가 발생하고 있었다.
- delivered 됐는데 승인이 되지 않은 경우가 15건 있는데, 시스템 오류로 발생한 결측치로 보는 것이 타당하다고 생각.
- created 또한 마찬가지로 시스템 오류로 보는 것이 합당할 듯
결론: order_approved_at 의 결측치는 178개인데, 전체 데이터에 비하면 그 비율이 너무 작기에 그것을 알아내는 비용이 이유를 알아냈을 때 얻을 수 있는 이득보다 더 클 것이라 생각되어 행을 제거하면 어떨까?
order_delivered_timestamp¶
# 'order_approved_at' 컬럼에서 결측치가 포함된 행만 추출
missing_rows = merge_df[merge_df['order_delivered_timestamp'].isna()]
# 'order_status' 컬럼의 value_counts
order_status_counts = missing_rows['order_status'].value_counts()
print("Order Status Counts:")
print(order_status_counts)
Order Status Counts: order_status shipped 1257 canceled 745 unavailable 653 processing 376 invoiced 375 delivered 8 created 5 approved 3 Name: count, dtype: int64
order_status가 delivered 가 아닌 경우 당연하게도 배송된 시점의 타임스탬프는 비어있어야한다. 문제는 delivered 됐는데도 타임스탬프가 비어있는 경우이다.
- 하지만 그 개수가 8개로 행 제거.
- 그리고 나머지 status에 대해서 결측치가 있는 부분은 전자상거래 도메인에 대한 이해가 필요할거 같다.
결론: delivered를 제외한 나머지 상태에서는 아직 도착하지 않았으니 비어있는게 당연하다. 그리고 결측치 개수가 3,422개로 전체 데이터와 비교하면 약 3%정도 된다. 따라서 비어있는 부분은 그대로 두고 나머지 데이터들에 대해서만 분석을 진행해도 되지 않을까?
order_item_id ~ shipping_charges¶
- order_item_id, product_id, seller_id, price, shipping_charges
# columns_to_check에 대한 결측치들이 모두 835개로 서로 관계가 있을 것이라 생각해 같이 살펴보기로 함.
# 체크할 컬럼들 정의
columns_to_check = ['order_item_id', 'product_id', 'seller_id', 'price', 'shipping_charges']
# 각 컬럼에 대해 결측치가 포함된 행들의 'order_status'에 대한 value_counts 계산
for column in columns_to_check:
# 각 컬럼에 대해 결측치가 있는 행만 필터링
missing_rows = merge_df[merge_df[column].isna()]
# 해당 행들의 'order_status' value_counts 계산
order_status_counts = missing_rows['order_status'].value_counts()
# 결과 출력
print(f"Order Status Value Counts for Rows with Missing '{column}':")
print(order_status_counts)
print("-" * 50) # 구분선 출력
Order Status Value Counts for Rows with Missing 'order_item_id': order_status unavailable 646 canceled 181 created 5 invoiced 2 shipped 1 Name: count, dtype: int64 -------------------------------------------------- Order Status Value Counts for Rows with Missing 'product_id': order_status unavailable 646 canceled 181 created 5 invoiced 2 shipped 1 Name: count, dtype: int64 -------------------------------------------------- Order Status Value Counts for Rows with Missing 'seller_id': order_status unavailable 646 canceled 181 created 5 invoiced 2 shipped 1 Name: count, dtype: int64 -------------------------------------------------- Order Status Value Counts for Rows with Missing 'price': order_status unavailable 646 canceled 181 created 5 invoiced 2 shipped 1 Name: count, dtype: int64 -------------------------------------------------- Order Status Value Counts for Rows with Missing 'shipping_charges': order_status unavailable 646 canceled 181 created 5 invoiced 2 shipped 1 Name: count, dtype: int64 --------------------------------------------------
역시 columns_to_check에 해당하는 모든 컬럼들은 order_status의 영향을 받고 있었다.
- 그 중에서도 총 835개 중 646개가 unavailable에 속했다.
- unavailable에 속할 경우를 아래와 같이 살펴보면 order_purchase_timestamp, order_approved_at 에는 값이 있는 것으로 보아 판매자 측에서 재고 부족 등의 문제로 취소를 한 경우에 해당한다고 생각.
- canceled도 살펴봤을 때 order_purchase_timestamp, order_approved_at에 값이 있는 것으로 보아 본인이 취소한 경우에 속할 것이다.
결론: 이러한 결과들로 살펴봤을 때 본 프로젝트에서는 유효한 주문에 대해서만 분석을 진행하기 위해 835개의 행을 모두 제거하는 것이 좋지 않을까 생각
merge_df[merge_df['order_status'] == 'unavailable'].head(3)
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_timestamp | order_estimated_delivery_date | customer_zip_code_prefix | customer_city | customer_state | ... | shipping_charges | payment_sequential | payment_type | payment_installments | payment_value | product_category_name | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 320 | 8e24261a7e58791d10cb1bf9da94df5c | 41fc647b8c6bd979b1b6364b60471b50 | unavailable | 11/16/2017 15:09 | 11/16/2017 15:26 | NaN | 12/5/2017 0:00 | 89288 | sao bento do sul | SC | ... | NaN | 1.0 | credit_card | 5.0 | 84.00 | NaN | NaN | NaN | NaN | NaN |
| 701 | c272bcd21c287498b4883c7512019702 | 0e634b16e4c585acbd7b2e8276ce6677 | unavailable | 1/31/2018 11:31 | 1/31/2018 14:23 | NaN | 2/16/2018 0:00 | 11701 | praia grande | SP | ... | NaN | 1.0 | credit_card | 1.0 | 97.68 | NaN | NaN | NaN | NaN | NaN |
| 823 | 37553832a3a89c9b2db59701c357ca67 | 596ed6d7a35890b3fbac54ec01f69685 | unavailable | 8/14/2017 17:38 | 8/17/2017 0:15 | NaN | 9/5/2017 0:00 | 2318 | sao paulo | SP | ... | NaN | 1.0 | wallet | 1.0 | 132.46 | NaN | NaN | NaN | NaN | NaN |
3 rows × 24 columns
merge_df[merge_df['order_status'] == 'canceled'].head(3)
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_timestamp | order_estimated_delivery_date | customer_zip_code_prefix | customer_city | customer_state | ... | shipping_charges | payment_sequential | payment_type | payment_installments | payment_value | product_category_name | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 471 | 1b9ecfe83cdc259250e1a8aca174f0ad | 8ea097b1824dbd4d17af71b0afe04301 | canceled | 8/4/2018 14:29 | 8/7/2018 4:10 | NaN | 8/14/2018 0:00 | 5514 | sao paulo | SP | ... | 8.34 | 1.0 | wallet | 1.0 | 33.34 | toys | 350.0 | 22.0 | 11.0 | 17.0 |
| 737 | 714fb133a6730ab81fa1d3c1b2007291 | c4ebedb09beb89cc0314c5c0c33f8053 | canceled | 1/26/2018 21:34 | 1/26/2018 21:58 | NaN | 2/22/2018 0:00 | 9961 | diadema | SP | ... | 26.11 | 1.0 | credit_card | 4.0 | 96.01 | toys | 12300.0 | 40.0 | 40.0 | 40.0 |
| 1267 | 3a129877493c8189c59c60eb71d97c29 | 11c3d6c93dea6ce86769a4835c171bd9 | canceled | 1/25/2018 13:34 | 1/25/2018 13:50 | NaN | 2/23/2018 0:00 | 87013 | maringa | PR | ... | 15.10 | 1.0 | credit_card | 1.0 | 14.12 | toys | 329.0 | 20.0 | 16.0 | 18.0 |
3 rows × 24 columns
payment_sequential ~ payment_value¶
- payment_sequential, payment_type, payment_installments, payment_value
# columns_to_check2에 대한 결측치들이 모두 3개로 서로 관계가 있을 것이라 생각해 같이 살펴보기로 함.
columns_to_check2 = ['payment_sequential', 'payment_type', 'payment_installments', 'payment_value']
merge_df[merge_df[columns_to_check2].isna().any(axis=1)]
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_timestamp | order_estimated_delivery_date | customer_zip_code_prefix | customer_city | customer_state | ... | shipping_charges | payment_sequential | payment_type | payment_installments | payment_value | product_category_name | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36884 | bfbd0f9bdef84302105ad712db648a6c | 830d5b7aaa3b6f1e9ad63703bec97d23 | delivered | 9/15/2016 12:16 | 9/15/2016 12:16 | 11/9/2016 7:47 | 10/4/2016 0:00 | 14600 | sao joaquim da barra | SP | ... | 2.83 | NaN | NaN | NaN | NaN | toys | 1000.0 | 16.0 | 16.0 | 16.0 |
| 36885 | bfbd0f9bdef84302105ad712db648a6c | 830d5b7aaa3b6f1e9ad63703bec97d23 | delivered | 9/15/2016 12:16 | 9/15/2016 12:16 | 11/9/2016 7:47 | 10/4/2016 0:00 | 14600 | sao joaquim da barra | SP | ... | 2.83 | NaN | NaN | NaN | NaN | toys | 1000.0 | 16.0 | 16.0 | 16.0 |
| 36886 | bfbd0f9bdef84302105ad712db648a6c | 830d5b7aaa3b6f1e9ad63703bec97d23 | delivered | 9/15/2016 12:16 | 9/15/2016 12:16 | 11/9/2016 7:47 | 10/4/2016 0:00 | 14600 | sao joaquim da barra | SP | ... | 2.83 | NaN | NaN | NaN | NaN | toys | 1000.0 | 16.0 | 16.0 | 16.0 |
3 rows × 24 columns
데이터를 살펴봤을 때 완전히 동일한 데이터이다.
결론: 행 제거
product_category_name¶
# product_category_name이 결측치인 행의 product_id 추출
product_ids = merge_df[merge_df['product_category_name'].isna()]['product_id'].tolist()
product_ids
['71225f49be70df4297892f6a5fa62171', '9820e04e332fc450d6fd975befc1bc28', nan, nan, nan, '0ae10ca52e6cb8745e17c11e7c43907c', nan, nan, nan, 'c4d92c32b2e02363d3f087c7156a523c', '45a15b38cc3c0514717a1de673c6193c', nan, '3373d1ed7bf5f3187f110c0e9e40a02b', nan, nan, nan, nan, nan, nan, '0105b5323d24fc655f73052694dbbb3a', nan, nan, nan, nan, nan, nan, nan, nan, nan, '15f773b4fea5b2ab681b02620b0c3a95', '15f773b4fea5b2ab681b02620b0c3a95', nan, nan, nan, nan, '17823ffd2de8234f0e885a71109613a4', nan, nan, nan, nan, nan, nan, nan, nan, '6627f494b45486e462e2bb460fcb18e5', 'f58e45b16a42a325c144eb2c46a2bc57', '06ddfdf210c7e0259854ee543215088d', 'b61d1388a17e3f547d2bc218df02335b', 'fcde1899043318107d8f0ae043f69bc3', nan, nan, nan, nan, nan, '817a68e75e8fc94fa28915d8169d67bf', nan, nan, 'c7858457bc43561ccb96aaa2932cb085', nan, nan, nan, nan, nan, nan, '6ad6cc284cdcc1cdd2f06eef8d983913', '56139431d72cd51f19eb9f7dae4d1617', nan, nan, nan, nan, nan, '52a5346c9dae9cd560b103875b94d123', 'c230b471b7e21ff9060e68ee154afd70', 'a08ffa6ecdd0ab3c1f4fe348acc2553b', nan, '40a1e3c65a0bcf6f4ebba840a8156ba2', nan, nan, 'b20dff2518b593f3092fb7b3ba13588e', 'b20dff2518b593f3092fb7b3ba13588e', nan, '45a15b38cc3c0514717a1de673c6193c', '45a15b38cc3c0514717a1de673c6193c', 'c230b471b7e21ff9060e68ee154afd70', nan, nan, '629beb8e7317703dcc5f35b5463fd20e', '73c5d3186138770b5ae53055adf10ad9', '73c5d3186138770b5ae53055adf10ad9', nan, nan, nan, nan, nan, nan, 'f58e45b16a42a325c144eb2c46a2bc57', nan, nan, nan, nan, '31dbb0d1815bdc83c93f4b3472fa7c4d', nan, 'f0ea71b6e2ab4cb3bd8f5ba522a25a56', nan, '31dbb0d1815bdc83c93f4b3472fa7c4d', '31dbb0d1815bdc83c93f4b3472fa7c4d', '3373d1ed7bf5f3187f110c0e9e40a02b', 'a8cb589d0aa5e6308a8fcf133ecca711', nan, nan, nan, 'cae7a908351ea0b7acb0086a51f6f565', '5d923ead886c44b86845f69e50520c3e', nan, nan, 'ea8062a20ca055f459f3a99eccdec4b5', nan, nan, nan, '7af3e2da474486a3519b0cba9dea8ad9', '31dbb0d1815bdc83c93f4b3472fa7c4d', 'be59f0b9bd1d08d096188f98168fc987', 'a8e59319e3c44b5af3a5412d713af5bb', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, '3abf2d4698bf245577543ea01d9c7f16', nan, nan, nan, nan, nan, nan, 'e891d4a9622cae3b9fc2ec558bda155b', 'e891d4a9622cae3b9fc2ec558bda155b', '46b48281eb6d663ced748f324108c733', 'bcb815bba008d89458e428078c0b9211', nan, nan, '0103863bf3441460142ec23c74388e4c', '0105b5323d24fc655f73052694dbbb3a', '0105b5323d24fc655f73052694dbbb3a', nan, nan, nan, nan, '0b2f52bff0f8a4536a315bbdc34d8398', nan, nan, '0502d1a36be75bd36b452f31c6ed264a', '83f86eca5473b6d1c7d566223592908d', nan, nan, 'f0ea71b6e2ab4cb3bd8f5ba522a25a56', nan, nan, 'f58e45b16a42a325c144eb2c46a2bc57', nan, nan, nan, nan, nan, nan, 'e891d4a9622cae3b9fc2ec558bda155b', '45a15b38cc3c0514717a1de673c6193c', nan, 'd7b39598905ba8e791fe3f1f376d59f9', nan, '7e005c651eec410691559257f9ef8f64', nan, nan, nan, nan, nan, nan, '1fde73e6b6d28627ff16e222b0044b10', '4257f45f6ea58aa46885ed7ef723b6c0', nan, nan, nan, 'c230b471b7e21ff9060e68ee154afd70', nan, nan, nan, '0502d1a36be75bd36b452f31c6ed264a', '3846c262ddb08db0ea2cc1a1685590ff', nan, 'e76828e87d341935c7d4e9f26420cda0', '7ce49c8945602088c4c4ca3e0ceb8244', nan, '9820e04e332fc450d6fd975befc1bc28', nan, nan, nan, nan, '31dbb0d1815bdc83c93f4b3472fa7c4d', '31dbb0d1815bdc83c93f4b3472fa7c4d', nan, '0502d1a36be75bd36b452f31c6ed264a', nan, nan, nan, 'e891d4a9622cae3b9fc2ec558bda155b', 'e891d4a9622cae3b9fc2ec558bda155b', 'e891d4a9622cae3b9fc2ec558bda155b', nan, nan, nan, nan, nan, nan, '3846c262ddb08db0ea2cc1a1685590ff', nan, '3373d1ed7bf5f3187f110c0e9e40a02b', 'fbd2a9aa1f115244e2ce115ad8eacfca', '629beb8e7317703dcc5f35b5463fd20e', nan, nan, nan, 'c230b471b7e21ff9060e68ee154afd70', nan, nan, nan, nan, nan, 'e275b12a1775bcc17af5524af5ba80b4', nan, nan, nan, 'fcde1899043318107d8f0ae043f69bc3', '0c877471e35049402bca235e06cb8063', '0c877471e35049402bca235e06cb8063', nan, nan, nan, nan, nan, nan, 'aaa391b7c35a628a6318851e8e933e0b', nan, nan, nan, nan, nan, nan, nan, 'b5778c55de68777a0a3565e03b2e2817', nan, nan, nan, nan, nan, nan, nan, '6b7879a37ac2dbe5289a16706e859870', nan, nan, nan, nan, 'c7858457bc43561ccb96aaa2932cb085', nan, 'db1d1a4569fdbae28a0d3340d907f7af', nan, 'fcde1899043318107d8f0ae043f69bc3', nan, nan, nan, nan, '17823ffd2de8234f0e885a71109613a4', nan, nan, '1574ed2c73e4465c572f68dd77528203', nan, 'b607bf5bab84cb38111d2b6a10164385', nan, '0502d1a36be75bd36b452f31c6ed264a', '0502d1a36be75bd36b452f31c6ed264a', '0502d1a36be75bd36b452f31c6ed264a', nan, nan, nan, '6d0af4d8b482cc38d72acafc37bdd8f1', '0502d1a36be75bd36b452f31c6ed264a', nan, nan, nan, nan, nan, nan, nan, 'f58e45b16a42a325c144eb2c46a2bc57', nan, '46b48281eb6d663ced748f324108c733', '46b48281eb6d663ced748f324108c733', nan, nan, nan, nan, '1a2ece70966710dec441aa587993b5b4', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 'ca26080d8f0fa848c329fe732f5e34e8', nan, '05cf9ac595f28386ee763c98cbc2bad0', nan, nan, nan, 'c68b419d9c6038271b85bac98adb0fc9', nan, nan, nan, nan, nan, nan, '6fd83eb3e0799b775e4f946bd66657c0', '6fd83eb3e0799b775e4f946bd66657c0', '08a59a7c557b1b8d49e5029a337d83fd', '0502d1a36be75bd36b452f31c6ed264a', '0502d1a36be75bd36b452f31c6ed264a', nan, nan, nan, 'f9b1795281ce51b1cf39ef6d101ae8ab', '45a15b38cc3c0514717a1de673c6193c', 'c321936f167cdd14866ac591eea36f38', nan, nan, 'c230b471b7e21ff9060e68ee154afd70', 'c230b471b7e21ff9060e68ee154afd70', 'c230b471b7e21ff9060e68ee154afd70', 'c230b471b7e21ff9060e68ee154afd70', nan, nan, 'f58e45b16a42a325c144eb2c46a2bc57', nan, nan, nan, nan, nan, '51fe26bc4bb0f89f07096d8c96b35476', nan, nan, 'db1d1a4569fdbae28a0d3340d907f7af', 'db1d1a4569fdbae28a0d3340d907f7af', nan, nan, nan, nan, '9d09d82a6838777302175f209dcfc4ec', nan, nan, '9ce62e73b4ea1a6c74b68ba57aa1333b', '9ce62e73b4ea1a6c74b68ba57aa1333b', '9ce62e73b4ea1a6c74b68ba57aa1333b', '9ce62e73b4ea1a6c74b68ba57aa1333b', nan, 'ed000a9f9f04376551e24de254464c9e', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, '62fde58e97724f8b7519f3789eefa33f', '62fde58e97724f8b7519f3789eefa33f', '62fde58e97724f8b7519f3789eefa33f', '62fde58e97724f8b7519f3789eefa33f', '62fde58e97724f8b7519f3789eefa33f', '62fde58e97724f8b7519f3789eefa33f', 'b9adb2d8377c195da59e75743dbb7133', 'c7858457bc43561ccb96aaa2932cb085', '73df0070248804951b5c6ba9537c6fc3', '73df0070248804951b5c6ba9537c6fc3', '73df0070248804951b5c6ba9537c6fc3', '73df0070248804951b5c6ba9537c6fc3', '73df0070248804951b5c6ba9537c6fc3', '73df0070248804951b5c6ba9537c6fc3', nan, nan, 'a41e356c76fab66334f36de622ecbd3a', nan, nan, 'c230b471b7e21ff9060e68ee154afd70', 'c230b471b7e21ff9060e68ee154afd70', '0502d1a36be75bd36b452f31c6ed264a', nan, nan, nan, nan, nan, 'f9b1795281ce51b1cf39ef6d101ae8ab', nan, '1f5542855786adc0116f56cfd6f9f830', nan, nan, nan, 'b607bf5bab84cb38111d2b6a10164385', nan, '0502d1a36be75bd36b452f31c6ed264a', nan, '94c07760f470ab9997f27d0129682eb3', '94c07760f470ab9997f27d0129682eb3', nan, 'b20dff2518b593f3092fb7b3ba13588e', nan, nan, nan, nan, nan, '12ad6d9a5da40908f6741145e1a93443', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, '3d103df80d5d22d709244fe27bc29a8d', '3d103df80d5d22d709244fe27bc29a8d', '3d103df80d5d22d709244fe27bc29a8d', '3d103df80d5d22d709244fe27bc29a8d', nan, '3846c262ddb08db0ea2cc1a1685590ff', nan, nan, nan, nan, 'e948cf75e6b64307f7f76a49eaa9fc2f', 'beec9bc2305d3372fa3b3133ba3fd644', '2f282449dadb2e59ff2afcde0bd9f393', nan, nan, '46b48281eb6d663ced748f324108c733', nan, '0105b5323d24fc655f73052694dbbb3a', nan, 'b3f3d2afe706552a90fd99f870e4ddf5', nan, nan, nan, nan, nan, nan, '52a5346c9dae9cd560b103875b94d123', nan, nan, nan, nan, '17823ffd2de8234f0e885a71109613a4', nan, nan, nan, nan, nan, nan, '0502d1a36be75bd36b452f31c6ed264a', '0ed27993e016d4546e2038bfd75b812c', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 'd1802b79d2e79841cc2cadedd8d4e6a3', 'd1802b79d2e79841cc2cadedd8d4e6a3', '39e3b9b12cd0bf8ee681bbc1c130feb5', '3373d1ed7bf5f3187f110c0e9e40a02b', nan, nan, nan, nan, nan, nan, nan, nan, '31dbb0d1815bdc83c93f4b3472fa7c4d', '31dbb0d1815bdc83c93f4b3472fa7c4d', nan, nan, '3a78f64aac654298e4b9aff32fc21818', nan, nan, nan, nan, nan, nan, nan, '671446e8e3aa3df1eca47b6c354a2921', '671446e8e3aa3df1eca47b6c354a2921', nan, '1a1dba78177cbabf00a2e1eab46e62b6', nan, nan, nan, nan, nan, 'a2d0e33d71561d8116565565a6f68578', 'a2d0e33d71561d8116565565a6f68578', '52a5346c9dae9cd560b103875b94d123', nan, '62fde58e97724f8b7519f3789eefa33f', nan, nan, '50f39d42ee3c1155a39f330b0b53c735', '0502d1a36be75bd36b452f31c6ed264a', '3373d1ed7bf5f3187f110c0e9e40a02b', nan, '8c7274c4e638e7f229007fe28aafc741', nan, '17b9f231d057710012fe28e5a1ac5186', nan, nan, nan, nan, nan, nan, nan, nan, '17823ffd2de8234f0e885a71109613a4', nan, nan, '39fda099b9edc21f85a80d14021579b5', '0105b5323d24fc655f73052694dbbb3a', nan, '17823ffd2de8234f0e885a71109613a4', nan, nan, nan, '8d39a63db6a98e40571f486e0c711cf7', '3373d1ed7bf5f3187f110c0e9e40a02b', nan, nan, nan, nan, '9df852a6c839bc3202227666d70c7cae', nan, nan, nan, nan, nan, '45a15b38cc3c0514717a1de673c6193c', nan, nan, nan, 'be59f0b9bd1d08d096188f98168fc987', nan, nan, 'f58e45b16a42a325c144eb2c46a2bc57', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 'eef398550dd9575d23eb2d8be74bd45a', nan, 'e891d4a9622cae3b9fc2ec558bda155b', nan, nan, nan, nan, 'd64615580eb89e5d9a6ae0e37135a5fe', nan, nan, nan, nan, nan, nan, nan, 'b3f3d2afe706552a90fd99f870e4ddf5', 'b3f3d2afe706552a90fd99f870e4ddf5', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, '86047e935e01608082bdae92e41a11db', nan, 'f9b1795281ce51b1cf39ef6d101ae8ab', '6b82874c6b51b92913dcdb364eaaae0f', nan, nan, nan, nan, 'b3f3d2afe706552a90fd99f870e4ddf5', 'b3f3d2afe706552a90fd99f870e4ddf5', '31dbb0d1815bdc83c93f4b3472fa7c4d', nan, nan, '0502d1a36be75bd36b452f31c6ed264a', nan, nan, nan, '96cadd6d137bc9726fe1fae279d48631', '5f4bceafafc1903556735cb7f2842e78', nan, 'f58e45b16a42a325c144eb2c46a2bc57', nan, nan, '1f5542855786adc0116f56cfd6f9f830', nan, '46255a421f244cd985f138a63d77d09c', nan, nan, 'd6bcc9649c52a826c656563ffb87a036', nan, '270e70a55f9a0917f86b37cb32afcddd', nan, 'dc9f66a56b89e4278d921898a861a29a', nan, nan, '645fd9f79f14fad6e8a79fbc94dae528', nan, nan, nan, '0ed27993e016d4546e2038bfd75b812c', '0ed27993e016d4546e2038bfd75b812c', nan, nan, 'f9b1795281ce51b1cf39ef6d101ae8ab', nan, 'fcde1899043318107d8f0ae043f69bc3', nan, nan, nan, nan, 'cf1c83c0097bce11c715b94f292769b5', nan, '0105b5323d24fc655f73052694dbbb3a', '0105b5323d24fc655f73052694dbbb3a', nan, nan, nan, nan, 'a8cb589d0aa5e6308a8fcf133ecca711', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, '46255a421f244cd985f138a63d77d09c', '46255a421f244cd985f138a63d77d09c', nan, nan, nan, nan, nan, '242a6abd32d7630b48ef5c0a1bf527eb', '242a6abd32d7630b48ef5c0a1bf527eb', '242a6abd32d7630b48ef5c0a1bf527eb', '242a6abd32d7630b48ef5c0a1bf527eb', '242a6abd32d7630b48ef5c0a1bf527eb', nan, nan, '3373d1ed7bf5f3187f110c0e9e40a02b', nan, nan, '9df852a6c839bc3202227666d70c7cae', nan, nan, '194163a8a65a48a8c95623c399a9d4a7', nan, nan, nan, 'b61d1388a17e3f547d2bc218df02335b', nan, nan, nan, nan, nan, nan, nan, 'bc0cf6f75f7db177ba3d47677f5e467f', nan, nan, '1f5542855786adc0116f56cfd6f9f830', nan, 'f9b1795281ce51b1cf39ef6d101ae8ab', nan, nan, '46b48281eb6d663ced748f324108c733', nan, '5fb61f482620cb672f5e586bb132eae9', '0502d1a36be75bd36b452f31c6ed264a', nan, nan, '0502d1a36be75bd36b452f31c6ed264a', nan, nan, nan, nan, '0502d1a36be75bd36b452f31c6ed264a', nan, nan, nan, nan, nan, 'e10758160da97891c2fdcbc35f0f031d', nan, nan, nan, 'e891d4a9622cae3b9fc2ec558bda155b', 'cae7a908351ea0b7acb0086a51f6f565', nan, '794de06c32a626a5692ff50e4985d36f', nan, nan, nan, nan, nan, nan, 'eef398550dd9575d23eb2d8be74bd45a', nan, 'b01cedfa96d891427058852f0625e9ee', 'b01cedfa96d891427058852f0625e9ee', nan, nan, nan, nan, nan, nan, nan, '09337f4d317bae01dfa96e6d368eafc4', nan, nan, '0502d1a36be75bd36b452f31c6ed264a', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, '17823ffd2de8234f0e885a71109613a4', nan, nan, 'ea11e700a343582ad56e4c70e966cb36', nan, 'ab1741a86f20ab3aca7dd2c347d5a9e3', '0533c5ec3a042521493cfb6e12d6befd', '0533c5ec3a042521493cfb6e12d6befd', nan, nan, '0502d1a36be75bd36b452f31c6ed264a', '0502d1a36be75bd36b452f31c6ed264a', '0502d1a36be75bd36b452f31c6ed264a', '519693b7e29749ed7b6ad60cb07b9400', nan, nan, nan, nan, nan, nan, '0a7a5763f6ee3b862ca27208f8fbbdb7', nan, 'fedccbd5e370e8ddb7aae6fb4cb70347', 'ad113e2e3b77e66ef6ecfd7386cd13ac', nan, 'c230b471b7e21ff9060e68ee154afd70', 'c230b471b7e21ff9060e68ee154afd70', 'c230b471b7e21ff9060e68ee154afd70', nan, nan, '17823ffd2de8234f0e885a71109613a4', nan, 'e891d4a9622cae3b9fc2ec558bda155b', '73c5d3186138770b5ae53055adf10ad9', '73c5d3186138770b5ae53055adf10ad9', '73c5d3186138770b5ae53055adf10ad9', 'b5d80a1b72c94ff284760dabb4fb776b', nan, nan, nan, nan, nan, nan, nan, nan, 'e49a57a6afd465b7d3cd109daf09ca98', 'e49a57a6afd465b7d3cd109daf09ca98', 'f9b1795281ce51b1cf39ef6d101ae8ab', nan, 'f58e45b16a42a325c144eb2c46a2bc57', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, '194163a8a65a48a8c95623c399a9d4a7', nan, nan, nan, nan, nan, nan, nan, 'be59f0b9bd1d08d096188f98168fc987', '0502d1a36be75bd36b452f31c6ed264a', '31dbb0d1815bdc83c93f4b3472fa7c4d', nan, '212cc0fa7359ab242a697a03a574f719', nan, 'f9b1795281ce51b1cf39ef6d101ae8ab', nan, '39e3b9b12cd0bf8ee681bbc1c130feb5', '3e6092cf0f56b9c78bcbd539157b9f70', nan, nan, nan, '17823ffd2de8234f0e885a71109613a4', nan, nan, '3846c262ddb08db0ea2cc1a1685590ff', nan, nan, 'fcde1899043318107d8f0ae043f69bc3', nan, nan, 'b17a4e7b997ea645dd323bb1bed802ac', nan, nan, '1fde73e6b6d28627ff16e222b0044b10', '9ce62e73b4ea1a6c74b68ba57aa1333b', 'b61d1388a17e3f547d2bc218df02335b', nan, 'b02a3f9824b29932f36a74d811f1b14c', '7285e977b36a2243203636b240772c5e', nan, nan, nan, nan, '194163a8a65a48a8c95623c399a9d4a7', nan, nan, nan, nan, nan, '6b7879a37ac2dbe5289a16706e859870', nan, '86047e935e01608082bdae92e41a11db', '86047e935e01608082bdae92e41a11db', nan, nan, nan, nan, nan, nan, nan, '014fcf6bd5cd4c7ee29fb3bb618c445e', nan, 'c230b471b7e21ff9060e68ee154afd70', nan, nan, '3e6092cf0f56b9c78bcbd539157b9f70', nan, nan, nan, 'e275b12a1775bcc17af5524af5ba80b4', nan, nan, '0c877471e35049402bca235e06cb8063', '0c877471e35049402bca235e06cb8063', nan, nan, '6868e3f08c6acd37b48c785aedddfdb7', nan, nan, nan, nan, nan, '6b7879a37ac2dbe5289a16706e859870', 'd7b39598905ba8e791fe3f1f376d59f9', 'd7b39598905ba8e791fe3f1f376d59f9', 'd7b39598905ba8e791fe3f1f376d59f9', 'd7b39598905ba8e791fe3f1f376d59f9', 'd7b39598905ba8e791fe3f1f376d59f9', 'd7b39598905ba8e791fe3f1f376d59f9', nan, nan, nan, '17823ffd2de8234f0e885a71109613a4', nan, '3f6f946481fd39f4eda986012f6e0447', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, '194163a8a65a48a8c95623c399a9d4a7', nan, nan, nan, nan, nan, nan, '82758517d6c9cc65c509781f1e14e9f2', '82758517d6c9cc65c509781f1e14e9f2', '82758517d6c9cc65c509781f1e14e9f2', '82758517d6c9cc65c509781f1e14e9f2', nan, nan, nan, 'eef398550dd9575d23eb2d8be74bd45a', '3846c262ddb08db0ea2cc1a1685590ff', nan, '5455dbe20e64b08413e0f2c451e31ab9', '9f69acd4da62618a3f6365b732d00ccd', 'c7858457bc43561ccb96aaa2932cb085', nan, nan, '41f1e5a38185a9ae6090a552b712ba43', '46255a421f244cd985f138a63d77d09c', nan, nan, nan, nan, nan, 'f58e45b16a42a325c144eb2c46a2bc57', nan, 'a9d8f831888820cd641dcd5ca9fea4e8', nan, ...]
# product_ids 리스트에 있는 product_id들 중에서, product_category_name이 결측치가 아닌 행들을 필터링
filtered_df = merge_df[merge_df['product_id'].isin(product_ids) & merge_df['product_category_name'].notna()]
filtered_df
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_timestamp | order_estimated_delivery_date | customer_zip_code_prefix | customer_city | customer_state | ... | shipping_charges | payment_sequential | payment_type | payment_installments | payment_value | product_category_name | product_weight_g | product_length_cm | product_height_cm | product_width_cm |
|---|
0 rows × 24 columns
product_category_name에서 결측치를 포함하는 데이터의 product_id를 리스트로 뽑았다. 이 product_id 들에 대한 다른 행의 product_category_name 중 결측치가 아닌 것이 있을 수도 있지 않을까?
- 확인 결과 없었음.
결론: 해당 product_category_name은 아마 이제 더이상 판매하지 않는 제품들인거 같으므로, 삭제하는 것이 마땅하다고 생각.
product_weight_g ~ product_width_cm¶
- product_weight_g, product_length_cm, product_height_cm, product_width_cm
- 해당 컬럼들에 대해서는 전체 데이터 개수 대비 0.71%로 매우 작기 때문에 중앙값으로 대체
- 극단값의 영향을 줄이기 위해
최종 결측치 처리하는 코드¶
# 각 컬럼의 중앙값으로 결측치를 채우기 [product_weight_g, product_length_cm, product_height_cm, product_width_cm]
merge_df['product_weight_g'].fillna(merge_df['product_weight_g'].median(), inplace=True)
merge_df['product_length_cm'].fillna(merge_df['product_length_cm'].median(), inplace=True)
merge_df['product_height_cm'].fillna(merge_df['product_height_cm'].median(), inplace=True)
merge_df['product_width_cm'].fillna(merge_df['product_width_cm'].median(), inplace=True)
# 결측치를 제거할 열 목록
columns_to_drop_null= [
'order_approved_at',
'order_item_id',
'product_id',
'seller_id',
'price',
'shipping_charges',
'payment_sequential',
'payment_type',
'payment_installments',
'payment_value',
'product_category_name'
]
# 지정된 열들에서 결측치가 있는 행을 제거
merge_df = merge_df.dropna(subset=columns_to_drop_null)
# 분석 프로젝트에 필요 없다고 판단한 컬럼 제거
columns_to_drop = ['payment_sequential']
merge_df = merge_df.drop(columns=columns_to_drop)
merge_df.isna().sum()
order_id 0 customer_id 0 order_status 0 order_purchase_timestamp 0 order_approved_at 0 order_delivered_timestamp 2566 order_estimated_delivery_date 0 customer_zip_code_prefix 0 customer_city 0 customer_state 0 order_item_id 0 product_id 0 seller_id 0 price 0 shipping_charges 0 payment_type 0 payment_installments 0 payment_value 0 product_category_name 0 product_weight_g 0 product_length_cm 0 product_height_cm 0 product_width_cm 0 dtype: int64
order_delivered_timestamp를 제외한 나머지 컬럼들 결측치 처리 완료
3-2. 이상치 처리¶
위에서 살펴봤던 것처럼 모든 수치형 변수들에 이상치가 있는 것으로 확인되어 이상치 처리가 필요한 상황이다.
numerical_df.describe(include='all')
| order_item_id | price | shipping_charges | payment_sequential | payment_installments | payment_value | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 118325.000000 | 118325.000000 | 118325.000000 | 119157.000000 | 119157.000000 | 119157.000000 | 118305.000000 | 118305.000000 | 118305.000000 | 118305.000000 |
| mean | 1.196569 | 120.729200 | 20.037466 | 1.094178 | 2.941053 | 172.825878 | 2113.957364 | 30.254537 | 16.633143 | 23.066692 |
| std | 0.699144 | 184.130919 | 15.847918 | 0.726675 | 2.777111 | 267.813648 | 3789.273735 | 16.189991 | 13.453618 | 11.743808 |
| min | 1.000000 | 0.850000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | 2.000000 | 6.000000 |
| 25% | 1.000000 | 39.900000 | 13.080000 | 1.000000 | 1.000000 | 60.890000 | 300.000000 | 18.000000 | 8.000000 | 15.000000 |
| 50% | 1.000000 | 74.900000 | 16.280000 | 1.000000 | 2.000000 | 108.210000 | 700.000000 | 25.000000 | 13.000000 | 20.000000 |
| 75% | 1.000000 | 134.900000 | 21.190000 | 1.000000 | 4.000000 | 189.260000 | 1800.000000 | 38.000000 | 20.000000 | 30.000000 |
| max | 21.000000 | 6735.000000 | 409.680000 | 29.000000 | 24.000000 | 13664.080000 | 40425.000000 | 105.000000 | 105.000000 | 118.000000 |
# 수치형 변수에 대한 value_counts
numerical_cols = ['order_item_id', 'price', 'shipping_charges', 'payment_installments', 'payment_value',
'product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm']
numerical_df = merge_df[numerical_cols]
for col in numerical_df.columns:
print(numerical_df[col].value_counts().to_frame().join(numerical_df[col].value_counts(normalize=True).to_frame().cumsum()))
print("=" * 50)
count proportion
order_item_id
1.0 103276 0.876148
2.0 10268 0.963258
3.0 2388 0.983516
4.0 991 0.991924
5.0 471 0.995919
6.0 262 0.998142
7.0 62 0.998668
8.0 36 0.998973
9.0 29 0.999220
10.0 26 0.999440
11.0 17 0.999584
12.0 13 0.999695
13.0 8 0.999762
14.0 7 0.999822
15.0 5 0.999864
16.0 3 0.999890
17.0 3 0.999915
18.0 3 0.999941
19.0 3 0.999966
20.0 3 0.999992
21.0 1 1.000000
==================================================
count proportion
price
59.90 2623 0.022252
69.90 2115 0.040195
49.90 2035 0.057459
89.90 1628 0.071270
99.90 1522 0.084182
... ... ...
21.94 1 0.999966
869.99 1 0.999975
1019.49 1 0.999983
698.00 1 0.999992
213.39 1 1.000000
[5944 rows x 2 columns]
==================================================
count proportion
shipping_charges
15.10 3838 0.032560
7.78 2340 0.052411
14.10 1976 0.069175
11.85 1966 0.085854
18.23 1627 0.099656
... ... ...
53.02 1 0.999966
55.84 1 0.999975
39.37 1 0.999983
49.03 1 0.999992
0.62 1 1.000000
[6992 rows x 2 columns]
==================================================
count proportion
payment_installments
1.0 58681 0.497824
2.0 13750 0.614473
3.0 11778 0.714392
4.0 8000 0.782261
10.0 6906 0.840848
5.0 6040 0.892089
8.0 5078 0.935169
6.0 4631 0.974456
7.0 1841 0.990074
9.0 736 0.996318
12.0 164 0.997709
15.0 94 0.998507
18.0 38 0.998829
24.0 34 0.999118
11.0 25 0.999330
20.0 21 0.999508
13.0 18 0.999661
14.0 16 0.999796
17.0 7 0.999856
16.0 7 0.999915
21.0 5 0.999958
0.0 3 0.999983
23.0 1 0.999992
22.0 1 1.000000
==================================================
count proportion
payment_value
50.00 347 0.002944
100.00 300 0.005489
20.00 288 0.007932
77.57 251 0.010062
35.00 164 0.011453
... ... ...
526.67 1 0.999966
680.82 1 0.999975
438.81 1 0.999983
561.56 1 0.999992
281.43 1 1.000000
[28878 rows x 2 columns]
==================================================
count proportion
product_weight_g
200.0 7023 0.059580
150.0 5407 0.105451
250.0 4719 0.145485
300.0 4417 0.182957
400.0 3733 0.214626
... ... ...
7034.0 1 0.999966
361.0 1 0.999975
9167.0 1 0.999983
13805.0 1 0.999992
2676.0 1 1.000000
[2203 rows x 2 columns]
==================================================
count proportion
product_length_cm
16.0 18305 0.155292
20.0 10965 0.248314
30.0 7934 0.315622
17.0 6198 0.368204
18.0 5888 0.418155
... ... ...
83.0 8 0.999830
96.0 8 0.999898
94.0 6 0.999949
9.0 4 0.999983
8.0 2 1.000000
[99 rows x 2 columns]
==================================================
count proportion
product_height_cm
10.0 10358 0.087873
20.0 6905 0.146452
15.0 6858 0.204632
12.0 6526 0.259996
11.0 6422 0.314477
... ... ...
98.0 3 0.999932
92.0 3 0.999958
94.0 2 0.999975
97.0 2 0.999992
89.0 1 1.000000
[102 rows x 2 columns]
==================================================
count proportion
product_width_cm
20.0 12715 0.107869
11.0 11019 0.201349
15.0 9329 0.280492
16.0 8850 0.355572
30.0 8023 0.423635
... ... ...
103.0 1 0.999966
97.0 1 0.999975
104.0 1 0.999983
98.0 1 0.999992
86.0 1 1.000000
[95 rows x 2 columns]
==================================================
order_item_id¶
order_item_id에 대한 value_counts를 살펴보면
- 1일 때의 개수가 103,276개로 누적합 약 87.6%를 차지하고 있다.
- 또한 2일 때의 개수가 10,268개로 전체의 약 10%에 해당한다.
결론: 따라서, 1개를 구매한 경우와 2개 이상을 구매한 경우로 이진화하는 방법으로 범주형 변수로 변환하는 것이 나을 것이라 생각
- 1개를 구매한 경우 1, 2개 이상을 구매한 경우를 0으로 변환
- 이진화를 할 경우 불균형 문제를 어느정도 해결할 수 있을 것.
def binarize(df, column_name):
df[column_name] = df[column_name].apply(lambda x: 1 if x == 1 else 0)
return df
price¶
우측으로 꼬리가 긴 right-skewed 분포를 가지기 때문에 모델링에 부정적인 영향을 미칠 수 있다.
결론: 따라서, right-skewed 분포를 가져 극단값이 클러스터링 결과에 큰 영향을 미칠 수 있는 이러한 경우에 사용하면 효과적인 IQR 방법으로 이상치 제거.
def remove_outliers_iqr(df, column_name):
Q1 = df[column_name].quantile(0.25)
Q3 = df[column_name].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_cleaned = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]
return df_cleaned
payment_sequential¶
결론: payment_sequential의 경우 누적합 95%의 값이 1에 집중되어 있다. 따라서, 빈도수가 낮은 값들이 데이터의 왜곡을 초래할 수 있기 때문에, 빈도 기반 필터링을 통해 빈도수가 적은 결제 순서 값을 제거.
def remove_low_frequency_outliers(df, column_name, min_frequency=0.01):
value_counts = df[column_name].value_counts()
proportions = value_counts / len(df)
valid_values = proportions[proportions >= min_frequency].index
df_cleaned = df[df[column_name].isin(valid_values)]
return df_cleaned
payment_installments¶
결론: payment_installments는 1개월부터 12개월까지 거의 순서대로 나열되어 있으며, 누적합 액 99.8%를 차지한다.
- 따라서, 12개월 그 이상의 할부 개월은 이상치로 보고 필터링.
def remove_outliers_by_threshold(df, column_name, threshold):
df_cleaned = df[df[column_name] <= threshold]
return df_cleaned
payment_value¶
결론: price와 연관이 있는 변수로 동일하게 IQR 방법으로 제거.
shipping_charges¶
결론: IQR 방법으로 처리
product_weight_g, product_length_cm, product_height_cm, product_width_cm¶
결론: 위에서 boxplot을 살펴봤을 때 IQR이 적절해보임
최종 이상치 처리하는 코드¶
# 변수 이진화 함수
def binarize(df, column_name):
df[column_name] = df[column_name].apply(lambda x: 1 if x == 1 else 0)
return df
# IQR 제거 방식
def remove_outliers_iqr(df, column_name):
Q1 = df[column_name].quantile(0.25)
Q3 = df[column_name].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_cleaned = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]
return df_cleaned
# 빈도수 기반 필터링 함수
def remove_low_frequency_outliers(df, column_name, min_frequency=0.01):
value_counts = df[column_name].value_counts()
proportions = value_counts / len(df)
valid_values = proportions[proportions >= min_frequency].index
df_cleaned = df[df[column_name].isin(valid_values)]
return df_cleaned
def remove_outliers_by_threshold(df, column_name, threshold):
df_cleaned = df[df[column_name] <= threshold]
return df_cleaned
# order_item_id 1개는 1로, 2개 이상은 0으로 이진화하는 코드
merged_df_cleaned = binarize(merge_df, 'order_item_id')
# price, shipping_charges, payment_value IQR 방법으로 처리
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'price')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'shipping_charges')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'payment_value')
# payment_installments 12개월 초과는 필터링
merged_df_cleaned = remove_outliers_by_threshold(merged_df_cleaned, 'payment_installments', 12)
# product_weight_g, product_length_cm, product_height_cm, product_width_cm IQR 방법으로 처리
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_weight_g')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_length_cm')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_height_cm')
merged_df_cleaned = remove_outliers_iqr(merged_df_cleaned, 'product_width_cm')
# 이상치, 결측치 처리한 데이터
merged_df_cleaned.to_csv('merged_df_cleaned.csv', index=False)
merged_df_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Index: 77871 entries, 0 to 119159 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 77871 non-null object 1 customer_id 77871 non-null object 2 order_status 77871 non-null object 3 order_purchase_timestamp 77871 non-null object 4 order_approved_at 77871 non-null object 5 order_delivered_timestamp 76297 non-null object 6 order_estimated_delivery_date 77871 non-null object 7 customer_zip_code_prefix 77871 non-null int64 8 customer_city 77871 non-null object 9 customer_state 77871 non-null object 10 order_item_id 77871 non-null int64 11 product_id 77871 non-null object 12 seller_id 77871 non-null object 13 price 77871 non-null float64 14 shipping_charges 77871 non-null float64 15 payment_type 77871 non-null object 16 payment_installments 77871 non-null float64 17 payment_value 77871 non-null float64 18 product_category_name 77871 non-null object 19 product_weight_g 77871 non-null float64 20 product_length_cm 77871 non-null float64 21 product_height_cm 77871 non-null float64 22 product_width_cm 77871 non-null float64 dtypes: float64(8), int64(2), object(13) memory usage: 14.3+ MB
plt.figure(figsize=(20, 10))
for idx, col in enumerate(numerical_cols):
plt.subplot(3, 4, idx+1)
sns.boxplot(y=merged_df_cleaned[col])
plt.title(f'Boxplot of {col}')
plt.xlabel(col)
plt.ylabel('Value')
plt.tight_layout()
plt.show()
3-3 변수 스케일링¶
3-3-1. 수치형 변수 스케일링¶
현재 수치형 변수가 오른쪽으로 꼬리가 긴 right-skewed 분포를 가지고 있다. 따라서, 로그변환 방법으로 변수들을 조금 더 정규분포에 더 가깝게 만들어 비대칭성을 줄인다.
이후 정규화(Normalization)을 통해 데이터의 범위를 일정한 구간으로 변환하여 모델의 학습 속도와 성능을 높이고자 한다.
- KNN과 같은 거리 기반 알고리즘에서는 변수 간의 값의 범위가 일치하지 않으면 거리 계산에서 문제가 발생할 수 있는데, 정규화를 통해 해결 가능.
'product_length_cm', 'product_height_cm', 'product_width_cm'¶
세 개의 컬럼을 곱해서 부피 컬럼을 만들고 기존 3 개의 컬럼은 드랍.
# 'product_length_cm', 'product_height_cm', 'product_width_cm' 곱해서 'volume' 컬럼 생성
merged_df_cleaned['volume'] = merged_df_cleaned['product_length_cm'] * merged_df_cleaned['product_height_cm'] * merged_df_cleaned['product_width_cm']
# 기존의 세 개의 컬럼 드랍
merged_df_cleaned = merged_df_cleaned.drop(columns=['product_length_cm', 'product_height_cm', 'product_width_cm'])
merged_df_cleaned.head(10)
| order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_timestamp | order_estimated_delivery_date | customer_zip_code_prefix | customer_city | customer_state | ... | product_id | seller_id | price | shipping_charges | payment_type | payment_installments | payment_value | product_category_name | product_weight_g | volume | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | delivered | 10/2/2017 10:56 | 10/2/2017 11:07 | 10/10/2017 21:25 | 10/18/2017 0:00 | 3149 | sao paulo | SP | ... | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 | 29.99 | 8.72 | credit_card | 1.0 | 18.12 | housewares | 500.0 | 1976.0 |
| 1 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | delivered | 10/2/2017 10:56 | 10/2/2017 11:07 | 10/10/2017 21:25 | 10/18/2017 0:00 | 3149 | sao paulo | SP | ... | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 | 29.99 | 8.72 | voucher | 1.0 | 2.00 | housewares | 500.0 | 1976.0 |
| 2 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | delivered | 10/2/2017 10:56 | 10/2/2017 11:07 | 10/10/2017 21:25 | 10/18/2017 0:00 | 3149 | sao paulo | SP | ... | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 | 29.99 | 8.72 | voucher | 1.0 | 18.59 | housewares | 500.0 | 1976.0 |
| 3 | 53cdb2fc8bc7dce0b6741e2150273451 | af07308b275d755c9edb36a90c618231 | delivered | 7/24/2018 20:41 | 7/26/2018 3:24 | 8/7/2018 15:27 | 8/13/2018 0:00 | 47813 | barreiras | BA | ... | 595fac2a385ac33a80bd5114aec74eb8 | 289cdb325fb7e7f891c38608bf9e0962 | 118.70 | 22.76 | wallet | 1.0 | 141.46 | toys | 400.0 | 4693.0 |
| 4 | 47770eb9100c2d0c44946d9cf07ec65d | 3a653a41f6f9fc3d2a113cf8398680e8 | delivered | 8/8/2018 8:38 | 8/8/2018 8:55 | 8/17/2018 18:06 | 9/4/2018 0:00 | 75265 | vianopolis | GO | ... | aa4383b373c6aca5d8797843e5594415 | 4869f7a5dfa277a7dca6462dcf3b52b2 | 159.90 | 19.22 | credit_card | 3.0 | 179.12 | toys | 420.0 | 9576.0 |
| 5 | 949d5b44dbf5de918fe9c16f97b45f8a | 7c142cf63193a1473d2e66489a9ae977 | delivered | 11/18/2017 19:28 | 11/18/2017 19:45 | 12/2/2017 0:28 | 12/15/2017 0:00 | 59296 | sao goncalo do amarante | RN | ... | d0b61bfb1de832b15ba9d266ca96e5b0 | 66922902710d126a0e7d26b0e3805106 | 45.00 | 27.20 | credit_card | 1.0 | 72.20 | toys | 450.0 | 6000.0 |
| 6 | ad21c59c0840e6cb83a9ceb5573f8159 | 72632f0f9dd73dfee390c9b22eb56dd6 | delivered | 2/13/2018 21:18 | 2/13/2018 22:20 | 2/16/2018 18:17 | 2/26/2018 0:00 | 9195 | santo andre | SP | ... | 65266b2da20d04dbe00c5c2d3bb7859e | 2c9e548be18521d1c43cde1c582c6de8 | 19.90 | 8.72 | credit_card | 1.0 | 28.62 | toys | 250.0 | 11475.0 |
| 9 | 6514b8ad8028c9f2cc2374ded245783f | 932afa1e708222e5821dac9cd5db4cae | delivered | 5/16/2017 13:10 | 5/16/2017 13:22 | 5/26/2017 12:55 | 6/7/2017 0:00 | 26525 | nilopolis | RJ | ... | 4520766ec412348b8d4caa5e8a18c464 | 16090f2ca825584b5a147ab24aa30c86 | 59.99 | 15.17 | credit_card | 3.0 | 75.16 | toys | 50.0 | 4352.0 |
| 15 | 34513ce0c4fab462a55830c0989c7edb | 782987b81c92239d922aa49d6bd4200b | delivered | 7/13/2017 19:58 | 7/13/2017 20:10 | 7/19/2017 14:04 | 8/8/2017 0:00 | 4278 | sao paulo | SP | ... | f7e0fa615b386bc9a8b9eb52bc1fff76 | 87142160b41353c4e5fca2360caf6f92 | 98.00 | 16.13 | credit_card | 1.0 | 114.13 | toys | 325.0 | 9261.0 |
| 16 | 82566a660a982b15fb86e904c8d32918 | e97109680b052ee858d93a539597bba7 | delivered | 6/7/2018 10:06 | 6/9/2018 3:13 | 6/19/2018 12:05 | 7/18/2018 0:00 | 35400 | ouro preto | MG | ... | 72a97c271b2e429974398f46b93ae530 | 094ced053e257ae8cae57205592d6712 | 31.90 | 18.23 | wallet | 1.0 | 50.13 | toys | 450.0 | 4352.0 |
10 rows × 21 columns
고객별 재구매 횟수, 구매 총 상품 금액, 구매 평균 금액, 총 결제액, 평균 결제 액¶
merged_df_cleaned['repeat_order'] = merged_df_cleaned.groupby('customer_id')['order_id'].transform('nunique')
merged_df_cleaned['total_price'] = merged_df_cleaned.groupby('customer_id')['price'].transform('sum')
merged_df_cleaned['avg_price'] = merged_df_cleaned.groupby('customer_id')['price'].transform('mean')
merged_df_cleaned['total_payment_value'] = merged_df_cleaned.groupby('customer_id')['payment_value'].transform('sum')
merged_df_cleaned['avg_payment_value'] = merged_df_cleaned.groupby('customer_id')['payment_value'].transform('mean')
# 수치형, 범주형 변수 재정의
numerical_cols = ['price', 'shipping_charges', 'payment_value', 'product_weight_g', 'volume']
categorical_cols = ['order_status','customer_zip_code_prefix', 'customer_city', 'customer_state',
'order_item_id', 'payment_type', 'payment_installments', 'product_category_name']
# 수치형 변수 로그변환 함수
def log_transform(df, cols):
df_log_transformed = df[cols].apply(lambda x: np.log1p(x))
return df_log_transformed
def normalize(df, cols):
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df[cols]), columns=cols, index=df.index)
return df_normalized
# 로그 변환
log_transformed_df = log_transform(merged_df_cleaned, numerical_cols)
# 정규화 수행
normalized_df = normalize(log_transformed_df, numerical_cols)
# 결과를 원래 데이터프레임에 반영
merged_df_cleaned[numerical_cols] = normalized_df
# 변수 스케일링 후 수치형 kdeplot()
plt.figure(figsize=(20, 10))
for idx, col in enumerate(numerical_cols):
plt.subplot(3, 4, idx+1)
ax = sns.kdeplot(merged_df_cleaned[col], shade=True)
plt.title(f'KDE Plot of {col}')
plt.tight_layout()
plt.show()
3-4. 범주형 변수 인코딩¶
One-Hot-Encoding : 각 카테고리를 0과 1로 구성된 벡터로 표현하는 기법. 카테고리의 수만큼 벡터가 생성되므로 각 카테고리가 새로운 변수가 되어 표현된다. 다만, 카테고리가 너무 많은 변수의 경우 데이터의 cardinality를 증가시켜 모델의 성능을 저하시킬 수 있다는 단점.
Label-Encoding : n개의 범주형 데이터를 0 ~ n-1의 연속적인 수치 데이터로 표현. 데이터의 범주가 3개 이상일 때는 주의해서 사용. 라벨 인코딩은 한번 실행시킬 때 단 하나의 컬럼만 실행 가능.
우리 데이터에는 범주형 변수가 많고, 예를들어 product_category_name에는 70개의 카테고리를 가지고 있기 때문에, 원-핫 인코딩과 라벨 인코딩을 사용하는 데는 한계가 있을 것으로 사료된다. 따라서, 다음의 두 가지를 고려한 인코딩 방법을 생각해야 할 것이다.
- 카디널리티 문제 : 원-핫 인코딩의 경우, 카테고리의 수가 많을수록 차원이 증가하여 데이터가 희소해지고, 계산 비용이 높아지며, 모델 성능이 저하될 수 있다.
- 거리 측정 문제: 라벨 인코딩은 범주형 데이터를 순서가 있는 수치 데이터로 변환하기 때문에, KNN과 같은 거리 기반 모델에서는 인코딩된 값 간의 인위적인 거리 차이가 생겨 잘못된 결과를 초래할 수 있다.
우리의 선택 : Label-Encoding
- 범주형 데이터가 많기는 하지만, 그 카테고리를 줄인 후 레이블 인코딩을 하고자 한다.
추가적으로
- customer_zip_code_prefix, customer_city, customer_state와 같이 위치 정보를 나타내는 컬럼들은 따로 빼서 5개의 지방으로 묶는 과정이 필요.
- 장난감도 카테고리가 70개나 되기 때문에 묶어서 줄이는 과정이 필요.
- 날짜 변수들 order_purchase_timestamp, order_approved_at, order_delivered_timestamp, order_estimated_delivery_date은 dtype을 datetime으로 변경해 향후 분석에 더욱 용이하게 사용할 수 있도록 한다.
- order_status, payment_type 레이블 인코딩
customer_state, customer_zip_code_prefix, customer_city¶
# state로 5개의 지방으로 묶기
state_to_region = {
'AC': '북부 지방', 'AL': '북동부 지방', 'AP': '북부 지방', 'AM': '북부 지방',
'RR': '북부 지방', 'RO': '북부 지방', 'PA': '북부 지방', 'PB': '북동부 지방',
'MA': '북동부 지방', 'PI': '북동부 지방', 'PE': '북동부 지방', 'RN': '북동부 지방',
'CE': '북동부 지방', 'SE': '북동부 지방', 'BA': '북동부 지방', 'DF': '중서부 지방',
'TO': '북부 지방', 'GO': '중서부 지방', 'MS': '중서부 지방', 'MT': '중서부 지방',
'RJ': '남동부 지방', 'SP': '남동부 지방', 'MG': '남동부 지방', 'ES': '남동부 지방',
'RS': '남부 지방', 'SC': '남부 지방', 'PR': '남부 지방'
}
# state 컬럼을 기준으로 지방으로 변환
merged_df_cleaned['region'] = merged_df_cleaned['customer_state'].map(state_to_region)
# customer_zip_code_prefix, customer_city, customer_state 컬럼 drop
columns_to_drop = ['customer_zip_code_prefix', 'customer_city', 'customer_state']
merged_df_cleaned = merged_df_cleaned.drop(columns=columns_to_drop)
# 결과 확인
merged_df_cleaned['region'].value_counts()
region 남동부 지방 55872 남부 지방 11141 북동부 지방 5471 중서부 지방 4463 북부 지방 924 Name: count, dtype: int64
region 맵핑:
- 남동부 지방 : 0
- 남부 지방 : 1
- 북동부 지방 : 2
- 북부 지방 : 3
- 중서부 지방 : 4
# 결과 확인했으니 이제 레이블 인코딩
label_encoder = LabelEncoder()
# 'region' 컬럼을 레이블 인코딩
merged_df_cleaned['region'] = label_encoder.fit_transform(merged_df_cleaned['region'])
merged_df_cleaned['region'].value_counts()
region 0 55872 1 11141 2 5471 4 4463 3 924 Name: count, dtype: int64
product_category_name¶
# product_category_name 종류별로 묶어서 카테고리 줄이기
category_mapping = {
'가구/인테리어': [
'furniture_decor', 'furniture_living_room', 'furniture_bedroom',
'furniture_mattress_and_upholstery', 'kitchen_dining_laundry_garden_furniture',
'la_cuisine', 'flowers', 'cool_stuff', 'perfumery', 'party_supplies',
'bed_bath_table', 'market_place', 'home_construction', 'christmas_supplies'
],
'패션': [
'fashion_underwear_beach', 'fashion_bags_accessories', 'fashion_shoes',
'fashion_male_clothing', 'fashion_sport', 'fashion_childrens_clothes',
'fashio_female_clothing', 'housewares', 'watches_gifts'
],
'전자제품': [
'telephony', 'computers_accessories', 'audio', 'tablets_printing_image',
'cine_photo', 'musical_instruments', 'consoles_games', 'dvds_blu_ray',
'music', 'electronics', 'air_conditioning', 'small_appliances',
'home_appliances', 'home_appliances_2', 'small_appliances_home_oven_and_coffee',
'home_comfort_2', 'signaling_and_security', 'security_and_services',
'fixed_telephony'
],
'건설/공구': [
'construction_tools_construction', 'costruction_tools_garden',
'construction_tools_safety', 'construction_tools_lights',
'costruction_tools_tools', 'garden_tools'
],
'생활용품': [
'baby', 'diapers_and_hygiene', 'health_beauty', 'home_confort',
'luggage_accessories', 'auto', 'food', 'drinks', 'food_drink',
'sports_leisure', 'pet_shop', 'agro_industry_and_commerce'
],
'문구/사무용품': [
'stationery', 'office_furniture', 'books_technical',
'books_general_interest', 'books_imported', 'arts_and_craftmanship',
'art', 'industry_commerce_and_business'
],
'장난감': ['toys']
}
# 카테고리 매핑을 수행하는 함수
def map_category(category_name):
for main_category, subcategories in category_mapping.items():
if category_name in subcategories:
return main_category
return '기타'
# `product_category_name` 컬럼을 매핑하여 새로운 컬럼 추가
merged_df_cleaned['product_category_group'] = merged_df_cleaned['product_category_name'].apply(map_category)
# 'product_category_name' 컬럼 삭제
merged_df_cleaned = merged_df_cleaned.drop(columns='product_category_name')
merged_df_cleaned['product_category_group'].value_counts()
product_category_group 장난감 58189 생활용품 6108 가구/인테리어 4968 전자제품 4164 패션 2730 건설/공구 1107 문구/사무용품 605 Name: count, dtype: int64
product_category_group 인코딩:
- 가구/인테리어: 0
- 건설/공구: 1
- 문구/사무용품: 2
- 생활용품: 3
- 장난감: 4
- 전자제품: 5
- 패션: 6
# product_category_name 잘 맵핑 됐으니 이제 레이블 인코딩
label_encoder = LabelEncoder()
# 'product_category_group' 컬럼을 레이블 인코딩
merged_df_cleaned['product_category_group'] = label_encoder.fit_transform(merged_df_cleaned['product_category_group'])
# 변환된 데이터 확인
merged_df_cleaned['product_category_group'].value_counts()
product_category_group 4 58189 3 6108 0 4968 5 4164 6 2730 1 1107 2 605 Name: count, dtype: int64
order_purchase_timestamp, order_approved_at, order_delivered_timestamp, order_estimated_delivery_date¶
# 범주형 변수들 처리 전 날짜 변수들의 dtype datetime으로 변경.
date_columns = ['order_purchase_timestamp', 'order_approved_at',
'order_delivered_timestamp', 'order_estimated_delivery_date']
for col in date_columns:
merged_df_cleaned[col] = pd.to_datetime(merged_df_cleaned[col])
merged_df_cleaned[date_columns].dtypes
order_purchase_timestamp datetime64[ns] order_approved_at datetime64[ns] order_delivered_timestamp datetime64[ns] order_estimated_delivery_date datetime64[ns] dtype: object
order_status, payment_type¶
'delivered': 0, 'shipped': 1, 'canceled': 2, 'unavailable': 3, 'processing': 4, 'invoiced': 5, 'created': 6, 'approved': 7
'credit_card': 0, 'wallet': 1, 'voucher': 2, 'debit_card': 3, 'not_defined': 4
# order_status, payment_type 인코딩
# 매핑 딕셔너리 정의
order_status_mapping = {
'delivered': 0,
'shipped': 1,
'canceled': 2,
'unavailable': 3,
'processing': 4,
'invoiced': 5,
'created': 6,
'approved': 7
}
payment_type_mapping = {
'credit_card': 0,
'wallet': 1,
'voucher': 2,
'debit_card': 3,
'not_defined': 4
}
# 직접 레이블 인코딩 수행
merged_df_cleaned['order_status_encoded'] = merged_df_cleaned['order_status'].map(order_status_mapping)
merged_df_cleaned['payment_type_encoded'] = merged_df_cleaned['payment_type'].map(payment_type_mapping)
# 원본 컬럼 제거 (선택 사항)
merged_df_cleaned = merged_df_cleaned.drop(columns=['order_status', 'payment_type'])
merged_df_cleaned.iloc[:, :10].head(3)
| order_id | customer_id | order_purchase_timestamp | order_approved_at | order_delivered_timestamp | order_estimated_delivery_date | order_item_id | product_id | seller_id | price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | 2017-10-02 10:56:00 | 2017-10-02 11:07:00 | 2017-10-10 21:25:00 | 2017-10-18 | 1 | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 | 0.562297 |
| 1 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | 2017-10-02 10:56:00 | 2017-10-02 11:07:00 | 2017-10-10 21:25:00 | 2017-10-18 | 1 | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 | 0.562297 |
| 2 | e481f51cbdc54678b7cc49136f2d6af7 | 7c396fd4830fd04220f754e42b4e5bff | 2017-10-02 10:56:00 | 2017-10-02 11:07:00 | 2017-10-10 21:25:00 | 2017-10-18 | 1 | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 | 0.562297 |
merged_df_cleaned.iloc[:, 12:].head(3)
| payment_value | product_weight_g | volume | repeat_order | total_price | avg_price | total_payment_value | avg_payment_value | region | product_category_group | order_status_encoded | payment_type_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.516399 | 0.772611 | 0.418826 | 2 | 125.36 | 31.34 | 82.82 | 20.705 | 0 | 6 | 0 | 0 |
| 1 | 0.192265 | 0.772611 | 0.418826 | 2 | 125.36 | 31.34 | 82.82 | 20.705 | 0 | 6 | 0 | 2 |
| 2 | 0.520649 | 0.772611 | 0.418826 | 2 | 125.36 | 31.34 | 82.82 | 20.705 | 0 | 6 | 0 | 2 |
merged_df_cleaned.to_csv('merged_df_pre_processed.csv', index=False)
merged_df_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Index: 77871 entries, 0 to 119159 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 77871 non-null object 1 customer_id 77871 non-null object 2 order_purchase_timestamp 77871 non-null datetime64[ns] 3 order_approved_at 77871 non-null datetime64[ns] 4 order_delivered_timestamp 76297 non-null datetime64[ns] 5 order_estimated_delivery_date 77871 non-null datetime64[ns] 6 order_item_id 77871 non-null int64 7 product_id 77871 non-null object 8 seller_id 77871 non-null object 9 price 77871 non-null float64 10 shipping_charges 77871 non-null float64 11 payment_installments 77871 non-null float64 12 payment_value 77871 non-null float64 13 product_weight_g 77871 non-null float64 14 volume 77871 non-null float64 15 repeat_order 77871 non-null int64 16 total_price 77871 non-null float64 17 avg_price 77871 non-null float64 18 total_payment_value 77871 non-null float64 19 avg_payment_value 77871 non-null float64 20 region 77871 non-null int32 21 product_category_group 77871 non-null int32 22 order_status_encoded 77871 non-null int64 23 payment_type_encoded 77871 non-null int64 dtypes: datetime64[ns](4), float64(10), int32(2), int64(4), object(4) memory usage: 14.3+ MB
4. 군집 모델 실행¶
cluster_1 = merged_df_cleaned[['repeat_order','total_price', 'avg_price', 'total_payment_value', 'avg_payment_value']]
from sklearn.cluster import KMeans
# 군집 2개로 나누기
custoemr_kmeans2 = KMeans(n_clusters=2, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans2.fit(cluster_1)
cluster_1['cluster2'] = custoemr_kmeans2.labels_
# 군집 3개로 나누기
custoemr_kmeans3 = KMeans(n_clusters=3, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans3.fit(cluster_1)
cluster_1['cluster3'] = custoemr_kmeans3.labels_
# 군집 4개로 나누기
custoemr_kmeans4 = KMeans(n_clusters=4, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans4.fit(cluster_1)
cluster_1['cluster4'] = custoemr_kmeans4.labels_
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\1661325313.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy cluster_1['cluster2'] = custoemr_kmeans2.labels_ C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\1661325313.py:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy cluster_1['cluster3'] = custoemr_kmeans3.labels_ C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\1661325313.py:19: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy cluster_1['cluster4'] = custoemr_kmeans4.labels_
# 실루엣 계수
from sklearn.metrics import silhouette_score
labels = custoemr_kmeans2.fit_predict(cluster_1)
silhouette_1 = silhouette_score(cluster_1, custoemr_kmeans2.labels_)
print(f'클러스터 개수 2개일 때: Silhouette Score = {silhouette_1:.6f}')
from sklearn.metrics import silhouette_score
labels = custoemr_kmeans3.fit_predict(cluster_1)
silhouette_2 = silhouette_score(cluster_1, custoemr_kmeans3.labels_)
print(f'클러스터 개수 3개일 때: Silhouette Score = {silhouette_2:.6f}')
from sklearn.metrics import silhouette_score
labels = custoemr_kmeans4.fit_predict(cluster_1)
silhouette_3 = silhouette_score(cluster_1, custoemr_kmeans4.labels_)
print(f'클러스터 개수 4개일 때: Silhouette Score = {silhouette_3:.6f}')
클러스터 개수 2개일 때: Silhouette Score = 0.765187 클러스터 개수 3개일 때: Silhouette Score = 0.575237 클러스터 개수 4개일 때: Silhouette Score = 0.559589
wcss_2 = custoemr_kmeans2.inertia_
print(f'클러스터 개수 3개일 때: WCSS = {wcss_2:.4f}')
wcss_3 = custoemr_kmeans3.inertia_
print(f'클러스터 개수 3개일 때: WCSS = {wcss_3:.4f}')
wcss_4 = custoemr_kmeans4.inertia_
print(f'클러스터 개수 3개일 때: WCSS = {wcss_4:.4f}')
클러스터 개수 3개일 때: WCSS = 3528533448.0313 클러스터 개수 3개일 때: WCSS = 2497989545.3095 클러스터 개수 3개일 때: WCSS = 1931635348.2375
plt.figure(figsize=(30,10))
plt.subplot(1,3,1)
sns.scatterplot(data=cluster_1, x='total_price', y='total_payment_value', hue='cluster2')
plt.subplot(1,3,2)
sns.scatterplot(data=cluster_1, x='total_price', y='total_payment_value', hue='cluster3')
plt.subplot(1,3,3)
sns.scatterplot(data=cluster_1, x='total_price', y='total_payment_value', hue='cluster4')
<Axes: xlabel='total_price', ylabel='total_payment_value'>
4-1. 엘보우 포인트 확인하기¶
def elbow(df):
sse = []
for i in range(1,15):
km = KMeans(n_clusters= i, init='k-means++', random_state=42)
km.fit(df)
sse.append(km.inertia_)
plt.plot(range(1,15), sse, marker = 'o')
plt.xlabel('cluster count')
plt.ylabel('SSE')
plt.show()
elbow(cluster_1)
4-2. 컬럼 별 상관관계 확인¶
import seaborn as sns
import matplotlib.pyplot as plt
# 각 열의 값들 사이의 상관관계 행렬을 구합니다.
correlation_matrix = cluster_1.corr()
# 상관관계 행렬을 바탕으로 히트맵을 그립니다.
sns.heatmap(data=correlation_matrix, annot=True, cmap='RdYlGn')
plt.show()
5. 군집 결과 해석¶
5-1. 군집 1¶
# 원본 자료는 살리기 위해 복사하기
merged_clu_df = merged_df_cleaned.copy()
# 클러스트 항목 설정
cluster_1 = merged_df_cleaned[['repeat_order','total_price', 'avg_price', 'total_payment_value', 'avg_payment_value']]
# 학습하기
# 군집 3개로 나누기
custoemr_kmeans3 = KMeans(n_clusters=3, init='k-means++', max_iter=300, random_state=42)
custoemr_kmeans3.fit(cluster_1)
# 군집된 결과 저장
merged_clu_df['cluster'] = custoemr_kmeans3.labels_
# 컬럼 지우는 함수
def del_cols(df):
del df['order_id']
del df['customer_id']
del df['seller_id']
del df['product_id']
del df['order_approved_at']
del df['order_delivered_timestamp']
del df['order_estimated_delivery_date']
# 군집 특성 확인하는 함수
def character_visual(df):
plt.figure(figsize=(20,20))
for i in range(len(df.columns)):
cols = list(df.columns)[i]
plt.subplot(4,6,i+1)
sns.histplot(df, x=cols, palette='RdYlGn')
plt.title(cols)
# 1번 군집만 설정
merged_clu1_df = merged_clu_df[merged_clu_df['cluster'] == 0]
del_cols(merged_clu1_df)
character_visual(merged_clu1_df)
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn')
5-2. 군집 2¶
merged_clu2_df = merged_clu_df[merged_clu_df['cluster'] == 1]
del_cols(merged_clu2_df)
character_visual(merged_clu2_df)
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn')
5-2. 군집 3¶
merged_clu3_df = merged_clu_df[merged_clu_df['cluster'] == 2]
del_cols(merged_clu3_df)
character_visual(merged_clu3_df)
C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn') C:\Users\Sungmin Kang\AppData\Local\Temp\ipykernel_2536\614600598.py:35: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(df, x=cols, palette='RdYlGn')
6. 추가 시각화¶
import seaborn as sns
import matplotlib.pyplot as plt
merged_df = merged_clu_df
merged_df['order_purchase_timestamp'] = pd.to_datetime(merged_df['order_purchase_timestamp'])
merged_df['quarter'] = merged_df['order_purchase_timestamp'].dt.to_period('Q')
# 모든 subplot에서 동일한 x축 레이블을 사용하기 위해 고유한 quarter 값을 얻음
unique_quarters = sorted(merged_df['quarter'].unique())
# 군집별로 quarter의 분포 시각화
plt.figure(figsize=(12, 6))
for cluster in range(3): # 3개의 군집이 있으므로 0, 1, 2로 루프
plt.subplot(1, 3, cluster + 1)
sns.countplot(data=merged_df[merged_df['cluster'] == cluster], x='quarter', palette='RdYlGn')
plt.title(f'Cluster {cluster} Quarter Distribution')
plt.xlabel('Quarter')
plt.ylabel('Count')
plt.xticks(ticks=range(len(unique_quarters)), labels=unique_quarters, rotation=45) # 동일한 x축 레이블 설정
plt.tight_layout()
plt.show()
import pandas as pd
# 분기별 군집
rows = []
for quarter in sorted(merged_df['quarter'].unique()):
row = {
'Quarter': quarter,
'Cluster 0 Count': merged_df[(merged_df['quarter'] == quarter) & (merged_df['cluster'] == 0)].shape[0],
'Cluster 1 Count': merged_df[(merged_df['quarter'] == quarter) & (merged_df['cluster'] == 1)].shape[0],
'Cluster 2 Count': merged_df[(merged_df['quarter'] == quarter) & (merged_df['cluster'] == 2)].shape[0]
}
rows.append(row)
# 데이터프레임으로 변환
summary_df = pd.DataFrame(rows)
print(summary_df)
Quarter Cluster 0 Count Cluster 1 Count Cluster 2 Count 0 2016Q3 1 0 0 1 2016Q4 167 4 71 2 2017Q1 3094 90 809 3 2017Q2 5523 128 1616 4 2017Q3 7309 226 2355 5 2017Q4 10413 244 3522 6 2018Q1 12451 353 3980 7 2018Q2 11397 341 3560 8 2018Q3 7855 164 2198
import pandas as pd
import matplotlib.pyplot as plt
# 데이터프레임 만들기
summary_df = pd.DataFrame(rows)
# 분기 열 str로 타입 변경
summary_df['Quarter'] = summary_df['Quarter'].astype(str)
# 시각화
plt.figure(figsize=(12, 6))
plt.plot(summary_df['Quarter'], summary_df['Cluster 0 Count'], marker='o', label='Cluster 0')
plt.plot(summary_df['Quarter'], summary_df['Cluster 1 Count'], marker='o', label='Cluster 1')
plt.plot(summary_df['Quarter'], summary_df['Cluster 2 Count'], marker='o', label='Cluster 2')
plt.xlabel('Quarter')
plt.ylabel('Count')
plt.title('Quarterly Distribution of Orders by Cluster')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.font_manager as fm
plt.rcParams['font.family'] = 'AppleGothic'
daily_payment = df.groupby(df['order_purchase_timestamp'].dt.date)['payment_value'].mean()
plt.figure(figsize=(14, 7))
plt.plot(daily_payment.index, daily_payment.values, label='Average Daily Payment Value', color='skyblue')
plt.scatter(daily_payment.index, daily_payment.values, color='red', s=10, label='Payment Value Points')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.gcf().autofmt_xdate()
plt.title('Change in monthly average sales value')
plt.xlabel('Date')
plt.ylabel('Average Payment Value')
plt.legend()
plt.grid(True)
plt.show()
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 50900 (\N{HANGUL SYLLABLE WEOL}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 48324 (\N{HANGUL SYLLABLE BYEOL}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 54217 (\N{HANGUL SYLLABLE PYEONG}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 44512 (\N{HANGUL SYLLABLE GYUN}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 54032 (\N{HANGUL SYLLABLE PAN}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 47588 (\N{HANGUL SYLLABLE MAE}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 44032 (\N{HANGUL SYLLABLE GA}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 52824 (\N{HANGUL SYLLABLE CI}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 51032 (\N{HANGUL SYLLABLE YI}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 48320 (\N{HANGUL SYLLABLE BYEON}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
c:\Users\Sungmin Kang\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:170: UserWarning: Glyph 54868 (\N{HANGUL SYLLABLE HWA}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.
findfont: Font family 'AppleGothic' not found.